|
--- |
|
language: |
|
- en |
|
license: mit |
|
tags: |
|
- code-generation |
|
- transformer |
|
- ast |
|
- cfg |
|
- langchain |
|
- ollama |
|
model_name: MiniCoderX |
|
datasets: |
|
- the-stack |
|
- codesearchnet |
|
- humaneval |
|
- mbpp |
|
- bugs2fix |
|
- java-python |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# ๐ MiniCoderX: A Lightweight Transformer for Code Generation |
|
|
|
**MiniCoderX** is a structure-aware, transformer-based small language model (SLM) for code generation. It blends modern architectural techniques with efficient deployment using tools like **LangChain** and **Ollama**, making it ideal for rapid local experimentation. |
|
|
|
--- |
|
|
|
## โจ Features |
|
|
|
- ๐ง Transformer-based encoder-decoder (TinyCodeT5 / DistilGPT2) |
|
- ๐ฒ AST/CFG-aware encoding for code structure understanding |
|
- ๐พ Syntax-constrained decoding using grammar rules and trees |
|
- ๐ Multi-task heads: generation, summarization, translation, bug fixing |
|
- โ๏ธ LangChain + Ollama integration for fast local deployment |
|
- ๐งช Evaluated on HumanEval, CodeXGLUE, MBPP |
|
|
|
--- |
|
|
|
## ๐๏ธ Model Architecture |
|
|
|
| Component | Description | |
|
|----------------|-----------------------------------------------------------| |
|
| Base | Tiny encoder-decoder (MiniLM, DistilGPT2, TinyCodeT5) | |
|
| Structure-aware | AST and Control Flow Graph embeddings + positional masks | |
|
| Heads | Multi-task heads for flexible downstream use | |
|
| Decoder | Syntax-aware beam search (grammar constraints) | |
|
| Tokenizer | BPE or SentencePiece trained on code + comments | |
|
|
|
--- |
|
|
|
## ๐ง Architectural Additions (SOTA Techniques) |
|
|
|
### ๐ฒ AST/CFG Embeddings |
|
Enhances understanding of code structure by: |
|
- Adding AST node/edge embeddings to token inputs |
|
- Including path embeddings between syntactic elements |
|
- Graph-aware position encoding |
|
|
|
Inspired by: **StructCoder**, **AST-T5**, **Code4Struct** |
|
|
|
### ๐พ Syntax-Constrained Decoding |
|
Improves generation accuracy and reduces invalid code by: |
|
- Restricting token outputs using grammar constraints (BNF/PEG) |
|
- Custom decoding logic (e.g., Tree traversal) |
|
- Dynamic decoding masks based on token state |
|
|
|
Inspired by: **TreeGen**, **Code4Struct** |
|
|
|
### ๐ Multi-Task Learning Heads |
|
Supports multiple tasks: |
|
- Code generation (NL โ Code) |
|
- Summarization (Code โ NL) |
|
- Translation (Java โ Python) |
|
- Code repair and completion |
|
|
|
Inspired by: **CodeT5+**, **CoTexT** |
|
|
|
--- |
|
|
|
## โก LangChain + Ollama Integration |
|
|
|
### ๐ก Why? |
|
To enable: |
|
- ๐งช Local testing and chaining of models via **LangChain** |
|
- ๐ฆฎ Fast prototyping with **Ollama** for custom transformer backends |
|
- ๐ Easy switch between small local models and larger remote APIs |
|
|
|
### ๐ Integration Plan |
|
```python |
|
from langchain.llms import Ollama |
|
from langchain.chains import LLMChain |
|
from langchain.prompts import PromptTemplate |
|
|
|
# Load MiniCoderX with Ollama |
|
llm = Ollama(model="minicoderx") # Local model via Ollama |
|
|
|
# Define code generation prompt |
|
prompt = PromptTemplate( |
|
input_variables=["instruction"], |
|
template="Generate Python code for the task: {instruction}", |
|
) |
|
|
|
chain = LLMChain(llm=llm, prompt=prompt) |
|
result = chain.run("Sort a list of integers using quicksort") |
|
|
|
print(result) |
|
``` |
|
|
|
> โ
Ollama will be used to serve your fine-tuned SLM locally |
|
> โ
LangChain will wrap it with prompts, chains, and memory features for interactivity |
|
|
|
--- |
|
|
|
## ๐ฆ Datasets |
|
|
|
| Dataset | Use | |
|
|----------------|----------------------------| |
|
| The Stack (subset) | Pretraining corpus | |
|
| CodeSearchNet | Summarization, Search | |
|
| HumanEval | Code generation benchmark | |
|
| MBPP | Python programming prompts | |
|
| Bugs2Fix | Code repair | |
|
| Java-Python | Cross-language translation | |
|
|
|
--- |
|
|
|
## ๐ฌ Training Objectives |
|
|
|
- โ
Span Masking (CodeT5-style) |
|
- โ
Contrastive pretraining |
|
- โ
Instruction tuning (natural prompt formatting) |
|
- โ
Auto-regressive generation |
|
|
|
--- |
|
|
|
## ๐ Evaluation Benchmarks |
|
|
|
| Benchmark | Metric | |
|
|------------|-------------------| |
|
| HumanEval | Pass@1, BLEU | |
|
| MBPP | Accuracy | |
|
| CodeXGLUE | CodeBLEU, EM | |
|
| Unit Tests | Pass Rate | |
|
|
|
--- |
|
|
|
## ๐งช Project Roadmap |
|
|
|
### โ
Phase 1: MVP Model |
|
- Train TinyCodeT5 model with span masking |
|
- Evaluate on MBPP and HumanEval-lite |
|
- Serve via Ollama + LangChain prompt chain |
|
|
|
### ๐ Phase 2: Structural Learning |
|
- Add AST/CFG encodings |
|
- Introduce grammar-constrained decoding |
|
- Multi-task training (gen, sum, repair) |
|
|
|
### ๐ฆ Phase 3: Optimization & Packaging |
|
- Distill from larger model (e.g., StarCoder) |
|
- Add reinforcement fine-tuning via test cases |
|
- Export to Hugging Face + Ollama integration |
|
|
|
--- |
|
|
|
## ๐ ๏ธ Tools & Frameworks |
|
|
|
- [Hugging Face Transformers](https://github.com/huggingface/transformers) |
|
- [LangChain](https://github.com/langchain-ai/langchain) |
|
- [Ollama](https://ollama.com/) |
|
- SentencePiece / BPE |
|
- NetworkX for AST/CFG parsing |
|
|
|
--- |
|
|
|
## ๐ค Contributing |
|
|
|
Want to help with grammar decoders, AST integration, or evaluation? PRs welcome! |
|
|
|
--- |
|
|
|
## ๐ License |
|
|
|
MIT License. Built for research and open experimentation. |
|
|
|
--- |
|
|
|
## ๐ง Contact |
|
|
|
Drop an issue or discussion on GitHub! |