--- language: - en license: mit tags: - code-generation - transformer - ast - cfg - langchain - ollama model_name: MiniCoderX datasets: - the-stack - codesearchnet - humaneval - mbpp - bugs2fix - java-python pipeline_tag: text-generation --- # ๐Ÿš€ MiniCoderX: A Lightweight Transformer for Code Generation **MiniCoderX** is a structure-aware, transformer-based small language model (SLM) for code generation. It blends modern architectural techniques with efficient deployment using tools like **LangChain** and **Ollama**, making it ideal for rapid local experimentation. --- ## โœจ Features - ๐Ÿง  Transformer-based encoder-decoder (TinyCodeT5 / DistilGPT2) - ๐ŸŒฒ AST/CFG-aware encoding for code structure understanding - ๐Ÿ’พ Syntax-constrained decoding using grammar rules and trees - ๐Ÿ” Multi-task heads: generation, summarization, translation, bug fixing - โš™๏ธ LangChain + Ollama integration for fast local deployment - ๐Ÿงช Evaluated on HumanEval, CodeXGLUE, MBPP --- ## ๐Ÿ—๏ธ Model Architecture | Component | Description | |----------------|-----------------------------------------------------------| | Base | Tiny encoder-decoder (MiniLM, DistilGPT2, TinyCodeT5) | | Structure-aware | AST and Control Flow Graph embeddings + positional masks | | Heads | Multi-task heads for flexible downstream use | | Decoder | Syntax-aware beam search (grammar constraints) | | Tokenizer | BPE or SentencePiece trained on code + comments | --- ## ๐Ÿ”ง Architectural Additions (SOTA Techniques) ### ๐ŸŒฒ AST/CFG Embeddings Enhances understanding of code structure by: - Adding AST node/edge embeddings to token inputs - Including path embeddings between syntactic elements - Graph-aware position encoding Inspired by: **StructCoder**, **AST-T5**, **Code4Struct** ### ๐Ÿ’พ Syntax-Constrained Decoding Improves generation accuracy and reduces invalid code by: - Restricting token outputs using grammar constraints (BNF/PEG) - Custom decoding logic (e.g., Tree traversal) - Dynamic decoding masks based on token state Inspired by: **TreeGen**, **Code4Struct** ### ๐Ÿ” Multi-Task Learning Heads Supports multiple tasks: - Code generation (NL โ†’ Code) - Summarization (Code โ†’ NL) - Translation (Java โ‡„ Python) - Code repair and completion Inspired by: **CodeT5+**, **CoTexT** --- ## โšก LangChain + Ollama Integration ### ๐Ÿ’ก Why? To enable: - ๐Ÿงช Local testing and chaining of models via **LangChain** - ๐Ÿฆฎ Fast prototyping with **Ollama** for custom transformer backends - ๐Ÿ”„ Easy switch between small local models and larger remote APIs ### ๐Ÿ”Œ Integration Plan ```python from langchain.llms import Ollama from langchain.chains import LLMChain from langchain.prompts import PromptTemplate # Load MiniCoderX with Ollama llm = Ollama(model="minicoderx") # Local model via Ollama # Define code generation prompt prompt = PromptTemplate( input_variables=["instruction"], template="Generate Python code for the task: {instruction}", ) chain = LLMChain(llm=llm, prompt=prompt) result = chain.run("Sort a list of integers using quicksort") print(result) ``` > โœ… Ollama will be used to serve your fine-tuned SLM locally > โœ… LangChain will wrap it with prompts, chains, and memory features for interactivity --- ## ๐Ÿ“ฆ Datasets | Dataset | Use | |----------------|----------------------------| | The Stack (subset) | Pretraining corpus | | CodeSearchNet | Summarization, Search | | HumanEval | Code generation benchmark | | MBPP | Python programming prompts | | Bugs2Fix | Code repair | | Java-Python | Cross-language translation | --- ## ๐Ÿ”ฌ Training Objectives - โœ… Span Masking (CodeT5-style) - โœ… Contrastive pretraining - โœ… Instruction tuning (natural prompt formatting) - โœ… Auto-regressive generation --- ## ๐Ÿ“Š Evaluation Benchmarks | Benchmark | Metric | |------------|-------------------| | HumanEval | Pass@1, BLEU | | MBPP | Accuracy | | CodeXGLUE | CodeBLEU, EM | | Unit Tests | Pass Rate | --- ## ๐Ÿงช Project Roadmap ### โœ… Phase 1: MVP Model - Train TinyCodeT5 model with span masking - Evaluate on MBPP and HumanEval-lite - Serve via Ollama + LangChain prompt chain ### ๐Ÿ” Phase 2: Structural Learning - Add AST/CFG encodings - Introduce grammar-constrained decoding - Multi-task training (gen, sum, repair) ### ๐Ÿ“ฆ Phase 3: Optimization & Packaging - Distill from larger model (e.g., StarCoder) - Add reinforcement fine-tuning via test cases - Export to Hugging Face + Ollama integration --- ## ๐Ÿ› ๏ธ Tools & Frameworks - [Hugging Face Transformers](https://github.com/huggingface/transformers) - [LangChain](https://github.com/langchain-ai/langchain) - [Ollama](https://ollama.com/) - SentencePiece / BPE - NetworkX for AST/CFG parsing --- ## ๐Ÿค Contributing Want to help with grammar decoders, AST integration, or evaluation? PRs welcome! --- ## ๐Ÿ“œ License MIT License. Built for research and open experimentation. --- ## ๐Ÿ“ง Contact Drop an issue or discussion on GitHub!