File size: 3,515 Bytes
8de40b2 10aa5aa 3b3be01 7adce78 3b3be01 c079448 e88bb02 8de40b2 3b3be01 e88bb02 3b3be01 f096bbb e88bb02 3b3be01 f096bbb 3b3be01 cbafc7a 3b3be01 f096bbb 020ebb7 e88bb02 f096bbb e88bb02 f096bbb 3b3be01 7adce78 f096bbb 7adce78 f096bbb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
---
library_name: transformers
tags:
- code
datasets:
- Leon-Leee/wizardlm_evol_instruct_v2_196K_backuped
- m-a-p/Code-Feedback
- openbmb/UltraInteract_sft
- ise-uiuc/Magicoder-Evol-Instruct-110K
- flytech/python-codes-25k
metrics:
- code_eval
pipeline_tag: text-generation
license: other
license name: deepseek
---
## AIGCodeGeek-DS-6.7B
### Introduction
AIGCodeGeek-DS-6.7B is our first released version of a Code-LLM family with competitive performance on public and private benchmarks.
### Model Details
#### Model Description
- Developed by: [Leon Li](https://huggingface.co/Leon-Leee)
- License: [DeepSeek](https://github.com/deepseek-ai/DeepSeek-Coder/blob/main/LICENSE-MODEL)
- Fine-tuned from [deepseek-ai/deepseek-coder-6.7b-base](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base) with full parameters
### Training data
A mixture of samples from high-quality open-source (read *Acknowledgements*) and our private datasets.
We have made contamination detection as Magicoder/Bigcode did (https://github.com/ise-uiuc/magicoder/blob/main/src/magicoder/decontamination/find_substrings.py).
### Evaluation
results to be added.
### Requirements
It should work with the same requirements as DeepSeek-Coder-6.7B or the following packages:
```torch>=2.0
tokenizers>=0.14.0
transformers>=4.35.0
accelerate
sympy>=1.12
pebble
timeout-decorator
attrdict
```
### QuickStart
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("aigcode/AIGCodeGeek-DS-6.7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("aigcode/AIGCodeGeek-DS-6.7B", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
messages=[
{ 'role': 'user', 'content': "write a merge sort algorithm in python."}
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
# tokenizer.eos_token_id is the id of <|EOT|> token
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))
```
### Acknowledgements
We gain a lot of knowledge and resources from the open-source community:
- [DeepSeekCoder](https://huggingface.co/deepseek-ai): impressive model series and insightful tech reports
- [WizardCoder](https://github.com/nlpxucan/WizardLM/tree/main/WizardCoder): Evol Instruct and public datasets
- We used a ([Leon-Leee/wizardlm_evol_instruct_v2_196K_backuped](https://huggingface.co/datasets/Leon-Leee/wizardlm_evol_instruct_v2_196K_backuped)) since this original has been deleted.
- [Magicoder](https://github.com/ise-uiuc/magicoder/): OSS-Instruct, [Magicoder-Evol-Instruct-110K](https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K) from theblackcat102/evol-codealpaca-v1(https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1)
- [Eurus](https://github.com/OpenBMB/Eurus): creative datasets for reasoning, [openbmb/UltraInteract_sft](https://huggingface.co/datasets/openbmb/UltraInteract_sft)
- [OpenCoderInterpreter](https://opencodeinterpreter.github.io/): well-designed system and datasets [m-a-p/Code-Feedback](https://huggingface.co/datasets/m-a-p/Code-Feedback)
- [flytech/python-codes-25k](https://huggingface.co/datasets/flytech/python-codes-25k): diversity
- [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory): easily used to finetune base models |