rtt4fb commited on
Commit
9b29568
·
verified ·
1 Parent(s): b6c2975

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +124 -142
README.md CHANGED
@@ -1,199 +1,181 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
 
 
5
 
6
- # Model Card for Model ID
 
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
 
9
 
 
10
 
 
 
 
 
 
11
 
12
- ## Model Details
13
 
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
 
103
  ## Evaluation
 
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
 
109
- #### Testing Data
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
 
 
 
 
 
 
112
 
113
- [More Information Needed]
114
 
115
- #### Factors
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
118
 
119
- [More Information Needed]
 
 
120
 
121
- #### Metrics
122
 
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
 
124
 
125
- [More Information Needed]
 
 
126
 
127
- ### Results
128
 
129
- [More Information Needed]
 
130
 
131
- #### Summary
132
 
 
 
 
 
 
 
133
 
 
134
 
135
- ## Model Examination [optional]
 
 
 
136
 
137
- <!-- Relevant interpretability work for the model goes here -->
 
138
 
139
- [More Information Needed]
 
140
 
141
- ## Environmental Impact
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
 
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
 
153
- ## Technical Specifications [optional]
154
 
155
- ### Model Architecture and Objective
156
 
157
- [More Information Needed]
158
 
159
- ### Compute Infrastructure
160
 
161
- [More Information Needed]
162
 
163
- #### Hardware
164
 
165
- [More Information Needed]
166
 
167
- #### Software
 
 
168
 
169
- [More Information Needed]
170
 
171
- ## Citation [optional]
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
174
 
175
- **BibTeX:**
176
 
177
- [More Information Needed]
 
 
178
 
179
- **APA:**
180
 
181
- [More Information Needed]
182
 
183
- ## Glossary [optional]
 
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
188
 
189
- ## More Information [optional]
190
 
191
- [More Information Needed]
 
 
 
192
 
193
- ## Model Card Authors [optional]
194
 
195
- [More Information Needed]
 
196
 
197
- ## Model Card Contact
198
 
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ license: apache-2.0
4
+ datasets:
5
+ - cais/mmlu
6
+ - maveriq/bigbenchhard
7
+ - CM/codexglue_code2text_python
8
+ - MatrixStudio/Codeforces-Python-Submissions
9
+ language:
10
+ - en
11
+ metrics:
12
+ - bertscore
13
+ - bleu
14
+ - accuracy
15
+ - exact_match
16
+ base_model:
17
+ - meta-llama/Llama-3.2-1B-Instruct
18
+ pipeline_tag: text-generation
19
  ---
20
+ # Model Card for rtt4fb/LlamaCode-Codeforces-v1
21
+ ### Developed by Taylor Tucker at the University of Virginia School of Data Science
22
 
23
+ ## Introduction
24
+ The use of large language models (LLMs) to generate practice questions for students could go a long way towards improving the educational opportunities for students by allowing for the flexible and individualized generation of practice problems, which can greatly increase student success [6]. It would also help make the lives of computer science educators easier by reducing the time-consuming process of generating material for students. The problem with current LLM capability in this regard is that the problems the models generate are barebones, boilerplate, and boring. I set out to train a LLM to generate interesting and applicable problems with which students may practice their skills without being dissuaded by the drabness of the material. This model may also be used to match student learning rates, allowing for more individualized instruction.
25
 
26
+ ## Training Data
27
+ To train the model, I utilized a [vast repository generated by MatrixStudio](https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions) of data from Codeforces, a Kyrgyzstan-based online programming competition [4]. The dataset consisted of 690,396 problems used in the competition and the topic tags describing each problem. After removing duplicate questions to avoid training and testing leakage, I was left with 16,533 examples, which were immediately split into an 80% / 20% train-test split. The testing split was stored and not used for training.
28
 
29
+ The problems in the Codeforces dataset are interesting, unique, and vast in terms of their subject matter. The original dataset was transformed, using a baseline prompt (i.e. "Please generate 3 Python programming question(s) involving the following subject areas:") alongside the problems associated tags. The response is the question itself. I utilized three-shot prompting of questions with the same tags to maximize the model’s learning capabilities.
30
 
31
+ ## Training Method
32
+ Based on the complex nature of the problems in the Codeforces repository, and without the computational capacity to perform a full fine-tune, I opted for the parameter-efficient fine-tuning strategy using LoRA due to its ease of implementation, computational efficiency, and proven track record [8]. LoRA is a widely utilized, parameter-efficient adapter methodology which has been shown to work well for fine-tuning operations across use cases. LoRA works by introducing adapter matrices in the layers of the LLM, training these low-latency parameters without changing the pre-trained abilities of the base model. The hyperparameters for LoRA used in this project are listed as follows:
33
+ - `LORA_R = 64`
34
+ - `LORA_ALPHA = 64`
35
+ - `LORA_DROPOUT = 0.05`
36
 
37
+ The base model in this experiment was Meta’s Llama 3.2 1 billion parameter Instruct model using its respective tokenizer [14]. The adapted model was trained on the training split of the Codeforces dataset over multiple days using two Nvidia A6000 GPUs.
38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
  ## Evaluation
41
+ This model will be evaluated on two benchmark, BigBenchHard, MMLU, and CodeXGlue and the testing set of the custom dataset derived from Codeforces [3], [4], [5], [15]. BigBenchHard was chosen as a benchmark to ensure that the model maintained general reasoning capabilities. CodeXGlue was used to ensure that the model maintained programming understanding, even though the aim of this fine-tuning process was to improve programming problem generation. MMLU was chosen to benchmark the models ability to retain general knowledge post-training and to indicate potential catastrophic forgetting if it occurs.
42
 
43
+ The results of LlamaCode-Codeforces-v1 will be compared against against Microsoft's Phi-4 Mini Instruct model and the baseline Llama 3.2 1 Billion Instruct model [11], [15]. These models were chosen due to the similar number of parameters to the LlamaCode-Codeforces-v1 model (3.8b and 1b, respectively). Further, both models are used for reasoning tasks.
 
 
44
 
 
45
 
46
+ | **Benchmark** | **BBH** | **CodeXGlue** | **MMLU** | **Codeforces Test** |
47
+ |------------------------:|:-------------:|:---------------:|:----------:|:--------------------:|
48
+ | _Metric_ | _Exact Match_ | _Smoothed Bleu_ | _Accuracy_ | _Mean BERT F1 Score_ |
49
+ | **Model** | | | | |
50
+ | LlamaCode-Codeforces-v1 | 0.0000 | 1.0346 | 0.3498 | **0.8019** |
51
+ | Llama 3.2 1B Instruct | 0.0000 | 1.0209 | 0.4666 | 0.8010 |
52
+ | Phi 4 Mini Instruct | 0.0000 | **1.0506** | **0.6846** | 0.6753 |
53
 
54
+ As can be seen in the table above, the fine-tuned LlamaCode-Codeforces-v1 model outperforms both baseline models in the custom Codeforces testing benchmark, with an average BERT F1 score of 0.8019. The fine-tuned model also outperforms the base Llama 3.2 1B Instruct on the CodeXGlue programming understanding benchmark, however, falls short of Phi-4 in this regard. The fine-tuned model performs worse on the MMLU benchmark in accuracy, however, still maintains some capabilties. Both Llama 3.2 1B and the fine-tuned model fall short to Phi-4 on MMLU by a large margin.
55
 
56
+ ## Usage & Intended Uses
57
 
58
+ ### Usage
59
+ Please ensure you have the PyTorch, HuggingFace Hub, and Transformers libraries installed:
60
 
61
+ ```bash
62
+ pip install torch huggingface_hub transformers
63
+ ```
64
 
65
+ To load the model as is, run:
66
 
67
+ ```python
68
+ from transformers import AutoModelForCausalLM, AutoTokenizer
69
+ import torch
70
 
71
+ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
72
+ model = AutoModelForCausalLM.from_pretrained("rtt4fb/LlamaCode-Codeforces-v1", device_map="auto", torch_dtype=torch.bfloat16)
73
+ ```
74
 
75
+ To use the model in a pipeline paradigm, run:
76
 
77
+ ```python
78
+ from transformers import pipeline
79
 
80
+ model_id = "rtt4fb/LlamaCode-Codeforces-v1"
81
 
82
+ pipe = pipeline(
83
+ "text-generation",
84
+ model=model_id,
85
+ model_kwargs={"torch_dtype": torch.bfloat16},
86
+ device_map="auto"
87
+ )
88
 
89
+ prompt = "Write two programming problems about: Strings."
90
 
91
+ outputs = pipe(
92
+ prompt,
93
+ max_new_tokens=512
94
+ )
95
 
96
+ print(outputs)
97
+ ```
98
 
99
+ ### Intended Uses
100
+ The intended use cases for this model are to experimentally generate programming practice problems. It is not intended at this time for use in an educational setting, although, and future research may be needed to determine the effectiveness and safety of this model for problem generation.
101
 
102
+ This model is also intended for computer science educators to use as a brainstorming mechanism for problem generation, whether for the classroom experience, assignments, or assessments. This model can be used in conjuction with textbooks, online resources, or other educational materials to enhance the quality of programming problems with which to provide students. Instructors must perform quality control over the outputs of this model in order to ensure appropriateness, safety, quality, and applicability of the outputs are maintained.
103
 
104
+ ## Prompt Format
105
 
106
+ An example training prompt is below, derived from the training data:
107
 
108
+ ```
109
+ Please generate 3 Python programming question(s) involving the following subject areas: greedy, implementation. For example:
 
 
 
110
 
111
+ 1) Ivan has got an array of *n* non-negative integers *a*1,<=*a*2,<=...,<=*a**n*. Ivan knows that the array is sorted in the non-decreasing order.
112
 
113
+ Ivan wrote out integers 2*a*1,<=2*a*2,<=...,<=2*a**n* on a piece of paper. Now he wonders, what minimum number of integers of form 2*b* (*b*<=≥<=0) need to be added to the piece of paper so that the sum of all integers written on the paper equalled 2*v*<=-<=1 for some integer *v* (*v*<=≥<=0).
114
 
115
+ Help Ivan, find the required quantity of numbers.
116
 
117
+ 2) Permutation *p* is an ordered set of integers *p*1,<=<=*p*2,<=<=...,<=<=*p**n*, consisting of *n* distinct positive integers, each of them doesn't exceed *n*. We'll denote the *i*-th element of permutation *p* as *p**i*. We'll call number *n* the size or the length of permutation *p*1,<=<=*p*2,<=<=...,<=<=*p**n*.
118
 
119
+ The decreasing coefficient of permutation *p*1,<=*p*2,<=...,<=*p**n* is the number of such *i* (1<=≤<=*i*<=&lt;<=*n*), that *p**i*<=&gt;<=*p**i*<=+<=1.
120
 
121
+ You have numbers *n* and *k*. Your task is to print the permutation of length *n* with decreasing coefficient *k*.
122
 
123
+ 3) In Chelyabinsk lives a much respected businessman Nikita with a strange nickname "Boss". Once Nikita decided to go with his friend Alex to the Summer Biathlon World Cup. Nikita, as a very important person, received a token which allows to place bets on each section no more than on one competitor.
124
 
125
+ To begin with friends learned the rules: in the race there are *n* sections of equal length and *m* participants. The participants numbered from 1 to *m*. About each participant the following is known:
126
+ - *l**i* — the number of the starting section, - *r**i* — the number of the finishing section (*l**i*<=≤<=*r**i*),- *t**i* — the time a biathlete needs to complete an section of the path,- *c**i* — the profit in roubles. If the *i*-th sportsman wins on one of the sections, the profit will be given to the man who had placed a bet on that sportsman.
127
+ The *i*-th biathlete passes the sections from *l**i* to *r**i* inclusive. The competitor runs the whole way in (*r**i*<=-<=*l**i*<=+<=1)·*t**i* time units. It takes him exactly *t**i* time units to pass each section. In case of the athlete's victory on *k* sections the man who has betted on him receives *k*·*c**i* roubles.
128
 
129
+ In each section the winner is determined independently as follows: if there is at least one biathlete running this in this section, then among all of them the winner is the one who has ran this section in minimum time (spent minimum time passing this section). In case of equality of times the athlete with the smaller index number wins. If there are no participants in this section, then the winner in this section in not determined. We have to say that in the summer biathlon all the participants are moving at a constant speed.
130
 
131
+ We should also add that Nikita can bet on each section and on any contestant running in this section.
132
 
133
+ Help the friends find the maximum possible profit.
134
+ ```
135
 
136
+ However, a more reasonable, single-shot prompt may be:
137
 
138
+ ```
139
+ Please generate 3 Python programming question(s) involving the following subject areas: greedy, implementation.
140
+ ```
141
 
142
+ ## Expected Output
143
 
144
+ For the given example training data prompt above, the expected response is a Python programming question written in English, as seen below:
145
 
146
+ ```
147
+ Devu is a renowned classical singer. He is invited to many big functions/festivals. Recently he was invited to "All World Classical Singing Festival". Other than Devu, comedian Churu was also invited.
148
 
149
+ Devu has provided organizers a list of the songs and required time for singing them. He will sing *n* songs, *i**th* song will take *t**i* minutes exactly.
150
 
151
+ The Comedian, Churu will crack jokes. All his jokes are of 5 minutes exactly.
152
 
153
+ People have mainly come to listen Devu. But you know that he needs rest of 10 minutes after each song. On the other hand, Churu being a very active person, doesn't need any rest.
154
 
155
+ You as one of the organizers should make an optimal sсhedule for the event. For some reasons you must follow the conditions:
156
+ - The duration of the event must be no more than *d* minutes; - Devu must complete all his songs; - With satisfying the two previous conditions the number of jokes cracked by Churu should be as many as possible.
157
+ If it is not possible to find a way to conduct all the songs of the Devu, output -1. Otherwise find out maximum number of jokes that Churu can crack in the grand event.
158
+ ```
159
 
 
160
 
161
+ ## Limitations
162
+ One limitation of this model is its size. Being a 1-billion parameter model, it lacks the full capabilities of a larger language model. Particularly, due to the fine-tuning of the model and the subsequent drop in the MMLU benchmark, this model may perform worse on general knowledge prompts than other models. Further, due to the uncommon subjects of the fine-tuning data, the model may lose consistency of model outputs (e.g. swapping bits of questions). This may lead the model to produce programming questions which make little sense, hence the encouragement of educators or researchers to manually examing model responses for clarity.
163
 
164
+ Due to the nature of large language models, this model may output inappropriate and dangerous responses. The testing of this model cannot cover all of the potential risks of this technology. We strongly recommend that model outputs are carefully vetted for appropriateness and safety.
165
 
166
+ ## References
167
+ [1] R. Xie, C. Huang, J. Wang, and B. Dhingra, “Adversarial Math Word Problem Generation,” Jun. 15, 2024, arXiv: arXiv:2402.17916. doi: 10.48550/arXiv.2402.17916.\
168
+ [2] C. Si, D. Yang, and T. Hashimoto, “Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers,” Sep. 06, 2024, arXiv: arXiv:2409.04109. doi: 10.48550/arXiv.2409.04109.\
169
+ [3] M. Suzgun et al., “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them,” Oct. 17, 2022, arXiv: arXiv:2210.09261. doi: 10.48550/arXiv.2210.09261.\
170
+ [4] MatrixStudio, “Codeforces Python Submissions.” HuggingFace. [Online]. Available: https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions\
171
+ [5] S. Lu et al., “CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation,” Mar. 16, 2021, arXiv: arXiv:2102.04664. doi: 10.48550/arXiv.2102.04664.\
172
+ [6] F. Schargel and J. Smink, Helping Students Graduate: A Strategic Approach to Dropout Prevention, 0 ed. Routledge, 2013. doi: 10.4324/9781315854816.\
173
+ [7] gzipChrist, “Leetcode Problem Dataset.” Kaggle. Accessed: Jan. 29, 2025. [Online]. Available: https://www.kaggle.com/datasets/gzipchrist/leetcode-problem-dataset\
174
+ [8] E. J. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” Oct. 16, 2021, arXiv: arXiv:2106.09685. doi: 10.48550/arXiv.2106.09685.\
175
+ [9] D. Hendrycks, S. Basart, S. Kadavath, and M. Mazeika, “Measuring Coding Challenge Competence with APPS.” 2021. [Online]. Available: https://huggingface.co/datasets/codeparrot/apps\
176
+ [10] D. Hendrycks et al., “Measuring Massive Multitask Language Understanding,” Jan. 12, 2021, arXiv: arXiv:2009.03300. doi: 10.48550/arXiv.2009.03300.
177
+ [11] M. Abdin et al., “Phi-4 Technical Report,” Dec. 12, 2024, arXiv: arXiv:2412.08905. doi: 10.48550/arXiv.2412.08905.
178
+ [12] J. Austin et al., “Program Synthesis with Large Language Models,” Aug. 16, 2021, arXiv: arXiv:2108.07732. doi: 10.48550/arXiv.2108.07732.\
179
+ [13] P. Denny et al., “Prompt Problems: A New Programming Exercise for the Generative AI Era,” in Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1, Portland OR USA: ACM, Mar. 2024, pp. 296–302. doi: 10.1145/3626252.3630909.\
180
+ [14] X. Kang, Z. Wang, X. Jin, W. Wang, K. Huang, and Q. Wang, “Template-Driven LLM-Paraphrased Framework for Tabular Math Word Problem Generation,” Dec. 20, 2024, arXiv: arXiv:2412.15594. doi: 10.48550/arXiv.2412.15594.\
181
+ [15] A. Grattafiori et al., “The Llama 3 Herd of Models,” 2024, arXiv. doi: 10.48550/ARXIV.2407.21783.