Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,169 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# ๐ MiniCoderX: A Lightweight Transformer for Code Generation
|
2 |
+
|
3 |
+
**MiniCoderX** is a structure-aware, transformer-based small language model (SLM) for code generation. It blends modern architectural techniques with efficient deployment using tools like **LangChain** and **Ollama**, making it ideal for rapid local experimentation.
|
4 |
+
|
5 |
+
---
|
6 |
+
|
7 |
+
## โจ Features
|
8 |
+
|
9 |
+
- ๐ง Transformer-based encoder-decoder (TinyCodeT5 / DistilGPT2)
|
10 |
+
- ๐ฒ AST/CFG-aware encoding for code structure understanding
|
11 |
+
- ๐พ Syntax-constrained decoding using grammar rules and trees
|
12 |
+
- ๐ Multi-task heads: generation, summarization, translation, bug fixing
|
13 |
+
- โ๏ธ LangChain + Ollama integration for fast local deployment
|
14 |
+
- ๐งช Evaluated on HumanEval, CodeXGLUE, MBPP
|
15 |
+
|
16 |
+
---
|
17 |
+
|
18 |
+
## ๐๏ธ Model Architecture
|
19 |
+
|
20 |
+
| Component | Description |
|
21 |
+
|----------------|-----------------------------------------------------------|
|
22 |
+
| Base | Tiny encoder-decoder (MiniLM, DistilGPT2, TinyCodeT5) |
|
23 |
+
| Structure-aware | AST and Control Flow Graph embeddings + positional masks |
|
24 |
+
| Heads | Multi-task heads for flexible downstream use |
|
25 |
+
| Decoder | Syntax-aware beam search (grammar constraints) |
|
26 |
+
| Tokenizer | BPE or SentencePiece trained on code + comments |
|
27 |
+
|
28 |
+
---
|
29 |
+
|
30 |
+
## ๐ง Architectural Additions (SOTA Techniques)
|
31 |
+
|
32 |
+
### ๐ฒ AST/CFG Embeddings
|
33 |
+
Enhances understanding of code structure by:
|
34 |
+
- Adding AST node/edge embeddings to token inputs
|
35 |
+
- Including path embeddings between syntactic elements
|
36 |
+
- Graph-aware position encoding
|
37 |
+
|
38 |
+
Inspired by: **StructCoder**, **AST-T5**, **Code4Struct**
|
39 |
+
|
40 |
+
### ๐พ Syntax-Constrained Decoding
|
41 |
+
Improves generation accuracy and reduces invalid code by:
|
42 |
+
- Restricting token outputs using grammar constraints (BNF/PEG)
|
43 |
+
- Custom decoding logic (e.g., Tree traversal)
|
44 |
+
- Dynamic decoding masks based on token state
|
45 |
+
|
46 |
+
Inspired by: **TreeGen**, **Code4Struct**
|
47 |
+
|
48 |
+
### ๐ Multi-Task Learning Heads
|
49 |
+
Supports multiple tasks:
|
50 |
+
- Code generation (NL โ Code)
|
51 |
+
- Summarization (Code โ NL)
|
52 |
+
- Translation (Java โ Python)
|
53 |
+
- Code repair and completion
|
54 |
+
|
55 |
+
Inspired by: **CodeT5+**, **CoTexT**
|
56 |
+
|
57 |
+
---
|
58 |
+
|
59 |
+
## โก LangChain + Ollama Integration
|
60 |
+
|
61 |
+
### ๐ก Why?
|
62 |
+
To enable:
|
63 |
+
- ๐งช Local testing and chaining of models via **LangChain**
|
64 |
+
- ๐ฆฎ Fast prototyping with **Ollama** for custom transformer backends
|
65 |
+
- ๐ Easy switch between small local models and larger remote APIs
|
66 |
+
|
67 |
+
### ๐ Integration Plan
|
68 |
+
```python
|
69 |
+
from langchain.llms import Ollama
|
70 |
+
from langchain.chains import LLMChain
|
71 |
+
from langchain.prompts import PromptTemplate
|
72 |
+
|
73 |
+
# Load MiniCoderX with Ollama
|
74 |
+
llm = Ollama(model="minicoderx") # Local model via Ollama
|
75 |
+
|
76 |
+
# Define code generation prompt
|
77 |
+
prompt = PromptTemplate(
|
78 |
+
input_variables=["instruction"],
|
79 |
+
template="Generate Python code for the task: {instruction}",
|
80 |
+
)
|
81 |
+
|
82 |
+
chain = LLMChain(llm=llm, prompt=prompt)
|
83 |
+
result = chain.run("Sort a list of integers using quicksort")
|
84 |
+
|
85 |
+
print(result)
|
86 |
+
```
|
87 |
+
|
88 |
+
> โ
Ollama will be used to serve your fine-tuned SLM locally
|
89 |
+
> โ
LangChain will wrap it with prompts, chains, and memory features for interactivity
|
90 |
+
|
91 |
+
---
|
92 |
+
|
93 |
+
## ๐ฆ Datasets
|
94 |
+
|
95 |
+
| Dataset | Use |
|
96 |
+
|----------------|----------------------------|
|
97 |
+
| The Stack (subset) | Pretraining corpus |
|
98 |
+
| CodeSearchNet | Summarization, Search |
|
99 |
+
| HumanEval | Code generation benchmark |
|
100 |
+
| MBPP | Python programming prompts |
|
101 |
+
| Bugs2Fix | Code repair |
|
102 |
+
| Java-Python | Cross-language translation |
|
103 |
+
|
104 |
+
---
|
105 |
+
|
106 |
+
## ๐ฌ Training Objectives
|
107 |
+
|
108 |
+
- โ
Span Masking (CodeT5-style)
|
109 |
+
- โ
Contrastive pretraining
|
110 |
+
- โ
Instruction tuning (natural prompt formatting)
|
111 |
+
- โ
Auto-regressive generation
|
112 |
+
|
113 |
+
---
|
114 |
+
|
115 |
+
## ๐ Evaluation Benchmarks
|
116 |
+
|
117 |
+
| Benchmark | Metric |
|
118 |
+
|------------|-------------------|
|
119 |
+
| HumanEval | Pass@1, BLEU |
|
120 |
+
| MBPP | Accuracy |
|
121 |
+
| CodeXGLUE | CodeBLEU, EM |
|
122 |
+
| Unit Tests | Pass Rate |
|
123 |
+
|
124 |
+
---
|
125 |
+
|
126 |
+
## ๐งช Project Roadmap
|
127 |
+
|
128 |
+
### โ
Phase 1: MVP Model
|
129 |
+
- Train TinyCodeT5 model with span masking
|
130 |
+
- Evaluate on MBPP and HumanEval-lite
|
131 |
+
- Serve via Ollama + LangChain prompt chain
|
132 |
+
|
133 |
+
### ๐ Phase 2: Structural Learning
|
134 |
+
- Add AST/CFG encodings
|
135 |
+
- Introduce grammar-constrained decoding
|
136 |
+
- Multi-task training (gen, sum, repair)
|
137 |
+
|
138 |
+
### ๐ฆ Phase 3: Optimization & Packaging
|
139 |
+
- Distill from larger model (e.g., StarCoder)
|
140 |
+
- Add reinforcement fine-tuning via test cases
|
141 |
+
- Export to Hugging Face + Ollama integration
|
142 |
+
|
143 |
+
---
|
144 |
+
|
145 |
+
## ๐ ๏ธ Tools & Frameworks
|
146 |
+
|
147 |
+
- [Hugging Face Transformers](https://github.com/huggingface/transformers)
|
148 |
+
- [LangChain](https://github.com/langchain-ai/langchain)
|
149 |
+
- [Ollama](https://ollama.com/)
|
150 |
+
- SentencePiece / BPE
|
151 |
+
- NetworkX for AST/CFG parsing
|
152 |
+
|
153 |
+
---
|
154 |
+
|
155 |
+
## ๐ค Contributing
|
156 |
+
|
157 |
+
Want to help with grammar decoders, AST integration, or evaluation? PRs welcome!
|
158 |
+
|
159 |
+
---
|
160 |
+
|
161 |
+
## ๐ License
|
162 |
+
|
163 |
+
MIT License. Built for research and open experimentation.
|
164 |
+
|
165 |
+
---
|
166 |
+
|
167 |
+
## ๐ง Contact
|
168 |
+
|
169 |
+
Drop an issue or discussion on GitHub!
|