File size: 5,257 Bytes
6730f7a e2d1124 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 |
---
language:
- en
license: mit
tags:
- code-generation
- transformer
- ast
- cfg
- langchain
- ollama
model_name: MiniCoderX
datasets:
- the-stack
- codesearchnet
- humaneval
- mbpp
- bugs2fix
- java-python
pipeline_tag: text-generation
---
# ๐ MiniCoderX: A Lightweight Transformer for Code Generation
**MiniCoderX** is a structure-aware, transformer-based small language model (SLM) for code generation. It blends modern architectural techniques with efficient deployment using tools like **LangChain** and **Ollama**, making it ideal for rapid local experimentation.
---
## โจ Features
- ๐ง Transformer-based encoder-decoder (TinyCodeT5 / DistilGPT2)
- ๐ฒ AST/CFG-aware encoding for code structure understanding
- ๐พ Syntax-constrained decoding using grammar rules and trees
- ๐ Multi-task heads: generation, summarization, translation, bug fixing
- โ๏ธ LangChain + Ollama integration for fast local deployment
- ๐งช Evaluated on HumanEval, CodeXGLUE, MBPP
---
## ๐๏ธ Model Architecture
| Component | Description |
|----------------|-----------------------------------------------------------|
| Base | Tiny encoder-decoder (MiniLM, DistilGPT2, TinyCodeT5) |
| Structure-aware | AST and Control Flow Graph embeddings + positional masks |
| Heads | Multi-task heads for flexible downstream use |
| Decoder | Syntax-aware beam search (grammar constraints) |
| Tokenizer | BPE or SentencePiece trained on code + comments |
---
## ๐ง Architectural Additions (SOTA Techniques)
### ๐ฒ AST/CFG Embeddings
Enhances understanding of code structure by:
- Adding AST node/edge embeddings to token inputs
- Including path embeddings between syntactic elements
- Graph-aware position encoding
Inspired by: **StructCoder**, **AST-T5**, **Code4Struct**
### ๐พ Syntax-Constrained Decoding
Improves generation accuracy and reduces invalid code by:
- Restricting token outputs using grammar constraints (BNF/PEG)
- Custom decoding logic (e.g., Tree traversal)
- Dynamic decoding masks based on token state
Inspired by: **TreeGen**, **Code4Struct**
### ๐ Multi-Task Learning Heads
Supports multiple tasks:
- Code generation (NL โ Code)
- Summarization (Code โ NL)
- Translation (Java โ Python)
- Code repair and completion
Inspired by: **CodeT5+**, **CoTexT**
---
## โก LangChain + Ollama Integration
### ๐ก Why?
To enable:
- ๐งช Local testing and chaining of models via **LangChain**
- ๐ฆฎ Fast prototyping with **Ollama** for custom transformer backends
- ๐ Easy switch between small local models and larger remote APIs
### ๐ Integration Plan
```python
from langchain.llms import Ollama
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
# Load MiniCoderX with Ollama
llm = Ollama(model="minicoderx") # Local model via Ollama
# Define code generation prompt
prompt = PromptTemplate(
input_variables=["instruction"],
template="Generate Python code for the task: {instruction}",
)
chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run("Sort a list of integers using quicksort")
print(result)
```
> โ
Ollama will be used to serve your fine-tuned SLM locally
> โ
LangChain will wrap it with prompts, chains, and memory features for interactivity
---
## ๐ฆ Datasets
| Dataset | Use |
|----------------|----------------------------|
| The Stack (subset) | Pretraining corpus |
| CodeSearchNet | Summarization, Search |
| HumanEval | Code generation benchmark |
| MBPP | Python programming prompts |
| Bugs2Fix | Code repair |
| Java-Python | Cross-language translation |
---
## ๐ฌ Training Objectives
- โ
Span Masking (CodeT5-style)
- โ
Contrastive pretraining
- โ
Instruction tuning (natural prompt formatting)
- โ
Auto-regressive generation
---
## ๐ Evaluation Benchmarks
| Benchmark | Metric |
|------------|-------------------|
| HumanEval | Pass@1, BLEU |
| MBPP | Accuracy |
| CodeXGLUE | CodeBLEU, EM |
| Unit Tests | Pass Rate |
---
## ๐งช Project Roadmap
### โ
Phase 1: MVP Model
- Train TinyCodeT5 model with span masking
- Evaluate on MBPP and HumanEval-lite
- Serve via Ollama + LangChain prompt chain
### ๐ Phase 2: Structural Learning
- Add AST/CFG encodings
- Introduce grammar-constrained decoding
- Multi-task training (gen, sum, repair)
### ๐ฆ Phase 3: Optimization & Packaging
- Distill from larger model (e.g., StarCoder)
- Add reinforcement fine-tuning via test cases
- Export to Hugging Face + Ollama integration
---
## ๐ ๏ธ Tools & Frameworks
- [Hugging Face Transformers](https://github.com/huggingface/transformers)
- [LangChain](https://github.com/langchain-ai/langchain)
- [Ollama](https://ollama.com/)
- SentencePiece / BPE
- NetworkX for AST/CFG parsing
---
## ๐ค Contributing
Want to help with grammar decoders, AST integration, or evaluation? PRs welcome!
---
## ๐ License
MIT License. Built for research and open experimentation.
---
## ๐ง Contact
Drop an issue or discussion on GitHub! |