MiniCoderX / README.md
sanjudebnath's picture
Update README.md
6730f7a verified
---
language:
- en
license: mit
tags:
- code-generation
- transformer
- ast
- cfg
- langchain
- ollama
model_name: MiniCoderX
datasets:
- the-stack
- codesearchnet
- humaneval
- mbpp
- bugs2fix
- java-python
pipeline_tag: text-generation
---
# ๐Ÿš€ MiniCoderX: A Lightweight Transformer for Code Generation
**MiniCoderX** is a structure-aware, transformer-based small language model (SLM) for code generation. It blends modern architectural techniques with efficient deployment using tools like **LangChain** and **Ollama**, making it ideal for rapid local experimentation.
---
## โœจ Features
- ๐Ÿง  Transformer-based encoder-decoder (TinyCodeT5 / DistilGPT2)
- ๐ŸŒฒ AST/CFG-aware encoding for code structure understanding
- ๐Ÿ’พ Syntax-constrained decoding using grammar rules and trees
- ๐Ÿ” Multi-task heads: generation, summarization, translation, bug fixing
- โš™๏ธ LangChain + Ollama integration for fast local deployment
- ๐Ÿงช Evaluated on HumanEval, CodeXGLUE, MBPP
---
## ๐Ÿ—๏ธ Model Architecture
| Component | Description |
|----------------|-----------------------------------------------------------|
| Base | Tiny encoder-decoder (MiniLM, DistilGPT2, TinyCodeT5) |
| Structure-aware | AST and Control Flow Graph embeddings + positional masks |
| Heads | Multi-task heads for flexible downstream use |
| Decoder | Syntax-aware beam search (grammar constraints) |
| Tokenizer | BPE or SentencePiece trained on code + comments |
---
## ๐Ÿ”ง Architectural Additions (SOTA Techniques)
### ๐ŸŒฒ AST/CFG Embeddings
Enhances understanding of code structure by:
- Adding AST node/edge embeddings to token inputs
- Including path embeddings between syntactic elements
- Graph-aware position encoding
Inspired by: **StructCoder**, **AST-T5**, **Code4Struct**
### ๐Ÿ’พ Syntax-Constrained Decoding
Improves generation accuracy and reduces invalid code by:
- Restricting token outputs using grammar constraints (BNF/PEG)
- Custom decoding logic (e.g., Tree traversal)
- Dynamic decoding masks based on token state
Inspired by: **TreeGen**, **Code4Struct**
### ๐Ÿ” Multi-Task Learning Heads
Supports multiple tasks:
- Code generation (NL โ†’ Code)
- Summarization (Code โ†’ NL)
- Translation (Java โ‡„ Python)
- Code repair and completion
Inspired by: **CodeT5+**, **CoTexT**
---
## โšก LangChain + Ollama Integration
### ๐Ÿ’ก Why?
To enable:
- ๐Ÿงช Local testing and chaining of models via **LangChain**
- ๐Ÿฆฎ Fast prototyping with **Ollama** for custom transformer backends
- ๐Ÿ”„ Easy switch between small local models and larger remote APIs
### ๐Ÿ”Œ Integration Plan
```python
from langchain.llms import Ollama
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
# Load MiniCoderX with Ollama
llm = Ollama(model="minicoderx") # Local model via Ollama
# Define code generation prompt
prompt = PromptTemplate(
input_variables=["instruction"],
template="Generate Python code for the task: {instruction}",
)
chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run("Sort a list of integers using quicksort")
print(result)
```
> โœ… Ollama will be used to serve your fine-tuned SLM locally
> โœ… LangChain will wrap it with prompts, chains, and memory features for interactivity
---
## ๐Ÿ“ฆ Datasets
| Dataset | Use |
|----------------|----------------------------|
| The Stack (subset) | Pretraining corpus |
| CodeSearchNet | Summarization, Search |
| HumanEval | Code generation benchmark |
| MBPP | Python programming prompts |
| Bugs2Fix | Code repair |
| Java-Python | Cross-language translation |
---
## ๐Ÿ”ฌ Training Objectives
- โœ… Span Masking (CodeT5-style)
- โœ… Contrastive pretraining
- โœ… Instruction tuning (natural prompt formatting)
- โœ… Auto-regressive generation
---
## ๐Ÿ“Š Evaluation Benchmarks
| Benchmark | Metric |
|------------|-------------------|
| HumanEval | Pass@1, BLEU |
| MBPP | Accuracy |
| CodeXGLUE | CodeBLEU, EM |
| Unit Tests | Pass Rate |
---
## ๐Ÿงช Project Roadmap
### โœ… Phase 1: MVP Model
- Train TinyCodeT5 model with span masking
- Evaluate on MBPP and HumanEval-lite
- Serve via Ollama + LangChain prompt chain
### ๐Ÿ” Phase 2: Structural Learning
- Add AST/CFG encodings
- Introduce grammar-constrained decoding
- Multi-task training (gen, sum, repair)
### ๐Ÿ“ฆ Phase 3: Optimization & Packaging
- Distill from larger model (e.g., StarCoder)
- Add reinforcement fine-tuning via test cases
- Export to Hugging Face + Ollama integration
---
## ๐Ÿ› ๏ธ Tools & Frameworks
- [Hugging Face Transformers](https://github.com/huggingface/transformers)
- [LangChain](https://github.com/langchain-ai/langchain)
- [Ollama](https://ollama.com/)
- SentencePiece / BPE
- NetworkX for AST/CFG parsing
---
## ๐Ÿค Contributing
Want to help with grammar decoders, AST integration, or evaluation? PRs welcome!
---
## ๐Ÿ“œ License
MIT License. Built for research and open experimentation.
---
## ๐Ÿ“ง Contact
Drop an issue or discussion on GitHub!