rtt4fb
/

LlamaCode-Codeforces-v1

Text Generation

Transformers

Safetensors

English

Model card Files Files and versions Community

rtt4fb commited on 9 days ago

Commit

9b29568

verified ·

1 Parent(s): b6c2975

Update README.md

Browse files

Files changed (1) hide show

README.md +124 -142

README.md CHANGED Viewed

@@ -1,199 +1,181 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
 ## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
 library_name: transformers
+license: apache-2.0
+datasets:
+- cais/mmlu
+- maveriq/bigbenchhard
+- CM/codexglue_code2text_python
+- MatrixStudio/Codeforces-Python-Submissions
+language:
+- en
+metrics:
+- bertscore
+- bleu
+- accuracy
+- exact_match
+base_model:
+- meta-llama/Llama-3.2-1B-Instruct
+pipeline_tag: text-generation
 ---
+# Model Card for rtt4fb/LlamaCode-Codeforces-v1
+### Developed by Taylor Tucker at the University of Virginia School of Data Science
+## Introduction
+The use of large language models (LLMs) to generate practice questions for students could go a long way towards improving the educational opportunities for students by allowing for the flexible and individualized generation of practice problems, which can greatly increase student success [6]. It would also help make the lives of computer science educators easier by reducing the time-consuming process of generating material for students. The problem with current LLM capability in this regard is that the problems the models generate are barebones, boilerplate, and boring. I set out to train a LLM to generate interesting and applicable problems with which students may practice their skills without being dissuaded by the drabness of the material. This model may also be used to match student learning rates, allowing for more individualized instruction.
+## Training Data
+To train the model, I utilized a [vast repository generated by MatrixStudio](https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions) of data from Codeforces, a Kyrgyzstan-based online programming competition [4]. The dataset consisted of 690,396 problems used in the competition and the topic tags describing each problem. After removing duplicate questions to avoid training and testing leakage, I was left with 16,533 examples, which were immediately split into an 80% / 20% train-test split. The testing split was stored and not used for training.
+The problems in the Codeforces dataset are interesting, unique, and vast in terms of their subject matter. The original dataset was transformed, using a baseline prompt (i.e. "Please generate 3 Python programming question(s) involving the following subject areas:") alongside the problems associated tags. The response is the question itself. I utilized three-shot prompting of questions with the same tags to maximize the model’s learning capabilities.
+## Training Method
+Based on the complex nature of the problems in the Codeforces repository, and without the computational capacity to perform a full fine-tune, I opted for the parameter-efficient fine-tuning strategy using LoRA due to its ease of implementation, computational efficiency, and proven track record [8]. LoRA is a widely utilized, parameter-efficient adapter methodology which has been shown to work well for fine-tuning operations across use cases. LoRA works by introducing adapter matrices in the layers of the LLM, training these low-latency parameters without changing the pre-trained abilities of the base model. The hyperparameters for LoRA used in this project are listed as follows:
+- `LORA_R = 64`
+- `LORA_ALPHA = 64`
+- `LORA_DROPOUT = 0.05`
+The base model in this experiment was Meta’s Llama 3.2 1 billion parameter Instruct model using its respective tokenizer [14]. The adapted model was trained on the training split of the Codeforces dataset over multiple days using two Nvidia A6000 GPUs.
 ## Evaluation
+This model will be evaluated on two benchmark, BigBenchHard, MMLU, and CodeXGlue and the testing set of the custom dataset derived from Codeforces [3], [4], [5], [15]. BigBenchHard was chosen as a benchmark to ensure that the model maintained general reasoning capabilities. CodeXGlue was used to ensure that the model maintained programming understanding, even though the aim of this fine-tuning process was to improve programming problem generation. MMLU was chosen to benchmark the models ability to retain general knowledge post-training and to indicate potential catastrophic forgetting if it occurs.
+The results of LlamaCode-Codeforces-v1 will be compared against against Microsoft's Phi-4 Mini Instruct model and the baseline Llama 3.2 1 Billion Instruct model [11], [15]. These models were chosen due to the similar number of parameters to the LlamaCode-Codeforces-v1 model (3.8b and 1b, respectively). Further, both models are used for reasoning tasks.
+|           **Benchmark** |    **BBH**    |  **CodeXGlue**  |  **MMLU**  |  **Codeforces Test** |
+|------------------------:|:-------------:|:---------------:|:----------:|:--------------------:|
+|                _Metric_ | _Exact Match_ | _Smoothed Bleu_ | _Accuracy_ | _Mean BERT F1 Score_ |
+|               **Model** |               |                 |            |                      |
+| LlamaCode-Codeforces-v1 |     0.0000    |      1.0346     |   0.3498   |      **0.8019**      |
+|   Llama 3.2 1B Instruct |     0.0000    |      1.0209     |   0.4666   |        0.8010        |
+|     Phi 4 Mini Instruct |     0.0000    |    **1.0506**   | **0.6846** |        0.6753        |
+As can be seen in the table above, the fine-tuned LlamaCode-Codeforces-v1 model outperforms both baseline models in the custom Codeforces testing benchmark, with an average BERT F1 score of 0.8019. The fine-tuned model also outperforms the base Llama 3.2 1B Instruct on the CodeXGlue programming understanding benchmark, however, falls short of Phi-4 in this regard. The fine-tuned model performs worse on the MMLU benchmark in accuracy, however, still maintains some capabilties. Both Llama 3.2 1B and the fine-tuned model fall short to Phi-4 on MMLU by a large margin.
+## Usage & Intended Uses
+### Usage
+Please ensure you have the PyTorch, HuggingFace Hub, and Transformers libraries installed:
+```bash
+pip install torch huggingface_hub transformers
+```
+To load the model as is, run:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
+model  = AutoModelForCausalLM.from_pretrained("rtt4fb/LlamaCode-Codeforces-v1", device_map="auto", torch_dtype=torch.bfloat16)
+```
+To use the model in a pipeline paradigm, run:
+```python
+from transformers import pipeline
+model_id = "rtt4fb/LlamaCode-Codeforces-v1"
+pipe = pipeline(
+    "text-generation",
+    model=model_id,
+    model_kwargs={"torch_dtype": torch.bfloat16},
+    device_map="auto"
+    )
+prompt = "Write two programming problems about: Strings."
+outputs = pipe(
+    prompt,
+    max_new_tokens=512
+)
+print(outputs)
+```
+### Intended Uses
+The intended use cases for this model are to experimentally generate programming practice problems. It is not intended at this time for use in an educational setting, although, and future research may be needed to determine the effectiveness and safety of this model for problem generation.
+This model is also intended for computer science educators to use as a brainstorming mechanism for problem generation, whether for the classroom experience, assignments, or assessments. This model can be used in conjuction with textbooks, online resources, or other educational materials to enhance the quality of programming problems with which to provide students. Instructors must perform quality control over the outputs of this model in order to ensure appropriateness, safety, quality, and applicability of the outputs are maintained.
+## Prompt Format
+An example training prompt is below, derived from the training data:
+```
+Please generate 3 Python programming question(s) involving the following subject areas: greedy, implementation. For example:
+1) Ivan has got an array of *n* non-negative integers *a*1,<=*a*2,<=...,<=*a**n*. Ivan knows that the array is sorted in the non-decreasing order.
+Ivan wrote out integers 2*a*1,<=2*a*2,<=...,<=2*a**n* on a piece of paper. Now he wonders, what minimum number of integers of form 2*b* (*b*<=≥<=0) need to be added to the piece of paper so that the sum of all integers written on the paper equalled 2*v*<=-<=1 for some integer *v* (*v*<=≥<=0).
+Help Ivan, find the required quantity of numbers.
+2) Permutation *p* is an ordered set of integers *p*1,<=<=*p*2,<=<=...,<=<=*p**n*, consisting of *n* distinct positive integers, each of them doesn't exceed *n*. We'll denote the *i*-th element of permutation *p* as *p**i*. We'll call number *n* the size or the length of permutation *p*1,<=<=*p*2,<=<=...,<=<=*p**n*.
+The decreasing coefficient of permutation *p*1,<=*p*2,<=...,<=*p**n* is the number of such *i* (1<=≤<=*i*<=&lt;<=*n*), that *p**i*<=&gt;<=*p**i*<=+<=1.
+You have numbers *n* and *k*. Your task is to print the permutation of length *n* with decreasing coefficient *k*.
+3) In Chelyabinsk lives a much respected businessman Nikita with a strange nickname "Boss". Once Nikita decided to go with his friend Alex to the Summer Biathlon World Cup. Nikita, as a very important person, received a token which allows to place bets on each section no more than on one competitor.
+To begin with friends learned the rules: in the race there are *n* sections of equal length and *m* participants. The participants numbered from 1 to *m*. About each participant the following is known:
+-  *l**i* — the number of the starting section, -  *r**i* — the number of the finishing section (*l**i*<=≤<=*r**i*),-  *t**i* — the time a biathlete needs to complete an section of the path,-  *c**i* — the profit in roubles. If the *i*-th sportsman wins on one of the sections, the profit will be given to the man who had placed a bet on that sportsman.
+The *i*-th biathlete passes the sections from *l**i* to *r**i* inclusive. The competitor runs the whole way in (*r**i*<=-<=*l**i*<=+<=1)·*t**i* time units. It takes him exactly *t**i* time units to pass each section. In case of the athlete's victory on *k* sections the man who has betted on him receives *k*·*c**i* roubles.
+In each section the winner is determined independently as follows: if there is at least one biathlete running this in this section, then among all of them the winner is the one who has ran this section in minimum time (spent minimum time passing this section). In case of equality of times the athlete with the smaller index number wins. If there are no participants in this section, then the winner in this section in not determined. We have to say that in the summer biathlon all the participants are moving at a constant speed.
+We should also add that Nikita can bet on each section and on any contestant running in this section.
+Help the friends find the maximum possible profit.
+```
+However, a more reasonable, single-shot prompt may be:
+```
+Please generate 3 Python programming question(s) involving the following subject areas: greedy, implementation.
+```
+## Expected Output
+For the given example training data prompt above, the expected response is a Python programming question written in English, as seen below:
+```
+Devu is a renowned classical singer. He is invited to many big functions/festivals. Recently he was invited to "All World Classical Singing Festival". Other than Devu, comedian Churu was also invited.
+Devu has provided organizers a list of the songs and required time for singing them. He will sing *n* songs, *i**th* song will take *t**i* minutes exactly.
+The Comedian, Churu will crack jokes. All his jokes are of 5 minutes exactly.
+People have mainly come to listen Devu. But you know that he needs rest of 10 minutes after each song. On the other hand, Churu being a very active person, doesn't need any rest.
+You as one of the organizers should make an optimal sсhedule for the event. For some reasons you must follow the conditions:
+ -  The duration of the event must be no more than *d* minutes; -  Devu must complete all his songs; -  With satisfying the two previous conditions the number of jokes cracked by Churu should be as many as possible.
+If it is not possible to find a way to conduct all the songs of the Devu, output -1. Otherwise find out maximum number of jokes that Churu can crack in the grand event.
+```
+## Limitations
+One limitation of this model is its size. Being a 1-billion parameter model, it lacks the full capabilities of a larger language model. Particularly, due to the fine-tuning of the model and the subsequent drop in the MMLU benchmark, this model may perform worse on general knowledge prompts than other models. Further, due to the uncommon subjects of the fine-tuning data, the model may lose consistency of model outputs (e.g. swapping bits of questions). This may lead the model to produce programming questions which make little sense, hence the encouragement of educators or researchers to manually examing model responses for clarity.
+Due to the nature of large language models, this model may output inappropriate and dangerous responses. The testing of this model cannot cover all of the potential risks of this technology. We strongly recommend that model outputs are carefully vetted for appropriateness and safety.
+## References
+[1] R. Xie, C. Huang, J. Wang, and B. Dhingra, “Adversarial Math Word Problem Generation,” Jun. 15, 2024, arXiv: arXiv:2402.17916. doi: 10.48550/arXiv.2402.17916.\
+[2] C. Si, D. Yang, and T. Hashimoto, “Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers,” Sep. 06, 2024, arXiv: arXiv:2409.04109. doi: 10.48550/arXiv.2409.04109.\
+[3] M. Suzgun et al., “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them,” Oct. 17, 2022, arXiv: arXiv:2210.09261. doi: 10.48550/arXiv.2210.09261.\
+[4] MatrixStudio, “Codeforces Python Submissions.” HuggingFace. [Online]. Available: https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions\
+[5] S. Lu et al., “CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation,” Mar. 16, 2021, arXiv: arXiv:2102.04664. doi: 10.48550/arXiv.2102.04664.\
+[6] F. Schargel and J. Smink, Helping Students Graduate: A Strategic Approach to Dropout Prevention, 0 ed. Routledge, 2013. doi: 10.4324/9781315854816.\
+[7] gzipChrist, “Leetcode Problem Dataset.” Kaggle. Accessed: Jan. 29, 2025. [Online]. Available: https://www.kaggle.com/datasets/gzipchrist/leetcode-problem-dataset\
+[8] E. J. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” Oct. 16, 2021, arXiv: arXiv:2106.09685. doi: 10.48550/arXiv.2106.09685.\
+[9] D. Hendrycks, S. Basart, S. Kadavath, and M. Mazeika, “Measuring Coding Challenge Competence with APPS.” 2021. [Online]. Available: https://huggingface.co/datasets/codeparrot/apps\
+[10] D. Hendrycks et al., “Measuring Massive Multitask Language Understanding,” Jan. 12, 2021, arXiv: arXiv:2009.03300. doi: 10.48550/arXiv.2009.03300.
+[11] M. Abdin et al., “Phi-4 Technical Report,” Dec. 12, 2024, arXiv: arXiv:2412.08905. doi: 10.48550/arXiv.2412.08905.
+[12] J. Austin et al., “Program Synthesis with Large Language Models,” Aug. 16, 2021, arXiv: arXiv:2108.07732. doi: 10.48550/arXiv.2108.07732.\
+[13] P. Denny et al., “Prompt Problems: A New Programming Exercise for the Generative AI Era,” in Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1, Portland OR USA: ACM, Mar. 2024, pp. 296–302. doi: 10.1145/3626252.3630909.\
+[14] X. Kang, Z. Wang, X. Jin, W. Wang, K. Huang, and Q. Wang, “Template-Driven LLM-Paraphrased Framework for Tabular Math Word Problem Generation,” Dec. 20, 2024, arXiv: arXiv:2412.15594. doi: 10.48550/arXiv.2412.15594.\
+[15] A. Grattafiori et al., “The Llama 3 Herd of Models,” 2024, arXiv. doi: 10.48550/ARXIV.2407.21783.