sanjudebnath commited on
Commit
e2d1124
ยท
verified ยท
1 Parent(s): 9079429

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +169 -3
README.md CHANGED
@@ -1,3 +1,169 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ๐Ÿš€ MiniCoderX: A Lightweight Transformer for Code Generation
2
+
3
+ **MiniCoderX** is a structure-aware, transformer-based small language model (SLM) for code generation. It blends modern architectural techniques with efficient deployment using tools like **LangChain** and **Ollama**, making it ideal for rapid local experimentation.
4
+
5
+ ---
6
+
7
+ ## โœจ Features
8
+
9
+ - ๐Ÿง  Transformer-based encoder-decoder (TinyCodeT5 / DistilGPT2)
10
+ - ๐ŸŒฒ AST/CFG-aware encoding for code structure understanding
11
+ - ๐Ÿ’พ Syntax-constrained decoding using grammar rules and trees
12
+ - ๐Ÿ” Multi-task heads: generation, summarization, translation, bug fixing
13
+ - โš™๏ธ LangChain + Ollama integration for fast local deployment
14
+ - ๐Ÿงช Evaluated on HumanEval, CodeXGLUE, MBPP
15
+
16
+ ---
17
+
18
+ ## ๐Ÿ—๏ธ Model Architecture
19
+
20
+ | Component | Description |
21
+ |----------------|-----------------------------------------------------------|
22
+ | Base | Tiny encoder-decoder (MiniLM, DistilGPT2, TinyCodeT5) |
23
+ | Structure-aware | AST and Control Flow Graph embeddings + positional masks |
24
+ | Heads | Multi-task heads for flexible downstream use |
25
+ | Decoder | Syntax-aware beam search (grammar constraints) |
26
+ | Tokenizer | BPE or SentencePiece trained on code + comments |
27
+
28
+ ---
29
+
30
+ ## ๐Ÿ”ง Architectural Additions (SOTA Techniques)
31
+
32
+ ### ๐ŸŒฒ AST/CFG Embeddings
33
+ Enhances understanding of code structure by:
34
+ - Adding AST node/edge embeddings to token inputs
35
+ - Including path embeddings between syntactic elements
36
+ - Graph-aware position encoding
37
+
38
+ Inspired by: **StructCoder**, **AST-T5**, **Code4Struct**
39
+
40
+ ### ๐Ÿ’พ Syntax-Constrained Decoding
41
+ Improves generation accuracy and reduces invalid code by:
42
+ - Restricting token outputs using grammar constraints (BNF/PEG)
43
+ - Custom decoding logic (e.g., Tree traversal)
44
+ - Dynamic decoding masks based on token state
45
+
46
+ Inspired by: **TreeGen**, **Code4Struct**
47
+
48
+ ### ๐Ÿ” Multi-Task Learning Heads
49
+ Supports multiple tasks:
50
+ - Code generation (NL โ†’ Code)
51
+ - Summarization (Code โ†’ NL)
52
+ - Translation (Java โ‡„ Python)
53
+ - Code repair and completion
54
+
55
+ Inspired by: **CodeT5+**, **CoTexT**
56
+
57
+ ---
58
+
59
+ ## โšก LangChain + Ollama Integration
60
+
61
+ ### ๐Ÿ’ก Why?
62
+ To enable:
63
+ - ๐Ÿงช Local testing and chaining of models via **LangChain**
64
+ - ๐Ÿฆฎ Fast prototyping with **Ollama** for custom transformer backends
65
+ - ๐Ÿ”„ Easy switch between small local models and larger remote APIs
66
+
67
+ ### ๐Ÿ”Œ Integration Plan
68
+ ```python
69
+ from langchain.llms import Ollama
70
+ from langchain.chains import LLMChain
71
+ from langchain.prompts import PromptTemplate
72
+
73
+ # Load MiniCoderX with Ollama
74
+ llm = Ollama(model="minicoderx") # Local model via Ollama
75
+
76
+ # Define code generation prompt
77
+ prompt = PromptTemplate(
78
+ input_variables=["instruction"],
79
+ template="Generate Python code for the task: {instruction}",
80
+ )
81
+
82
+ chain = LLMChain(llm=llm, prompt=prompt)
83
+ result = chain.run("Sort a list of integers using quicksort")
84
+
85
+ print(result)
86
+ ```
87
+
88
+ > โœ… Ollama will be used to serve your fine-tuned SLM locally
89
+ > โœ… LangChain will wrap it with prompts, chains, and memory features for interactivity
90
+
91
+ ---
92
+
93
+ ## ๐Ÿ“ฆ Datasets
94
+
95
+ | Dataset | Use |
96
+ |----------------|----------------------------|
97
+ | The Stack (subset) | Pretraining corpus |
98
+ | CodeSearchNet | Summarization, Search |
99
+ | HumanEval | Code generation benchmark |
100
+ | MBPP | Python programming prompts |
101
+ | Bugs2Fix | Code repair |
102
+ | Java-Python | Cross-language translation |
103
+
104
+ ---
105
+
106
+ ## ๐Ÿ”ฌ Training Objectives
107
+
108
+ - โœ… Span Masking (CodeT5-style)
109
+ - โœ… Contrastive pretraining
110
+ - โœ… Instruction tuning (natural prompt formatting)
111
+ - โœ… Auto-regressive generation
112
+
113
+ ---
114
+
115
+ ## ๐Ÿ“Š Evaluation Benchmarks
116
+
117
+ | Benchmark | Metric |
118
+ |------------|-------------------|
119
+ | HumanEval | Pass@1, BLEU |
120
+ | MBPP | Accuracy |
121
+ | CodeXGLUE | CodeBLEU, EM |
122
+ | Unit Tests | Pass Rate |
123
+
124
+ ---
125
+
126
+ ## ๐Ÿงช Project Roadmap
127
+
128
+ ### โœ… Phase 1: MVP Model
129
+ - Train TinyCodeT5 model with span masking
130
+ - Evaluate on MBPP and HumanEval-lite
131
+ - Serve via Ollama + LangChain prompt chain
132
+
133
+ ### ๐Ÿ” Phase 2: Structural Learning
134
+ - Add AST/CFG encodings
135
+ - Introduce grammar-constrained decoding
136
+ - Multi-task training (gen, sum, repair)
137
+
138
+ ### ๐Ÿ“ฆ Phase 3: Optimization & Packaging
139
+ - Distill from larger model (e.g., StarCoder)
140
+ - Add reinforcement fine-tuning via test cases
141
+ - Export to Hugging Face + Ollama integration
142
+
143
+ ---
144
+
145
+ ## ๐Ÿ› ๏ธ Tools & Frameworks
146
+
147
+ - [Hugging Face Transformers](https://github.com/huggingface/transformers)
148
+ - [LangChain](https://github.com/langchain-ai/langchain)
149
+ - [Ollama](https://ollama.com/)
150
+ - SentencePiece / BPE
151
+ - NetworkX for AST/CFG parsing
152
+
153
+ ---
154
+
155
+ ## ๐Ÿค Contributing
156
+
157
+ Want to help with grammar decoders, AST integration, or evaluation? PRs welcome!
158
+
159
+ ---
160
+
161
+ ## ๐Ÿ“œ License
162
+
163
+ MIT License. Built for research and open experimentation.
164
+
165
+ ---
166
+
167
+ ## ๐Ÿ“ง Contact
168
+
169
+ Drop an issue or discussion on GitHub!