π§ Mistral-7B DPO Fine-Tuned Adapter (PEFT)
This repository hosts a PEFT adapter trained via Direct Preference Optimization (DPO) using LoRA on top of mistralai/Mistral-7B-Instruct-v0.2
. The preference dataset was generated with PairRM, a reward model capable of ranking responses with strong human alignment.
π¦ Model Details
Attribute | Value |
---|---|
Base Model | Mistral-7B-Instruct-v0.2 |
Training Method | DPO (Direct Preference Optimization) |
Adapter Type | PEFT - LoRA |
Preference Model | PairRM |
Frameworks | HuggingFace π€ Transformers + TRL + PEFT |
Compute | 4 Γ A800 GPUs |
π Dataset
- Source: GAIR/LIMA
- Generation Process:
- 50 instructions sampled from LIMA
- Each instruction was completed 5 times using the base model
- Pairwise preferences generated using
llm-blender/PairRM
- Final Format: DPO-formatted JSONL
π Dataset Repository: jasperyeoh2/mistral-dpo-dataset
π§ͺ Evaluation
- 10 unseen instructions from the LIMA test split were used for evaluation
- Completions from base vs. DPO model were compared side-by-side
- DPO model demonstrated better politeness, clarity, and alignment
π Usage (with PEFT)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = "mistralai/Mistral-7B-Instruct-v0.2"
adapter = "jasperyeoh2/mistral-dpo-peft"
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.float16, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)
- Downloads last month
- 16
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support