🧠 Mistral-7B DPO Fine-Tuned Adapter (PEFT)

This repository hosts a PEFT adapter trained via Direct Preference Optimization (DPO) using LoRA on top of mistralai/Mistral-7B-Instruct-v0.2. The preference dataset was generated with PairRM, a reward model capable of ranking responses with strong human alignment.

📦 Model Details

Attribute	Value
Base Model	Mistral-7B-Instruct-v0.2
Training Method	DPO (Direct Preference Optimization)
Adapter Type	PEFT - LoRA
Preference Model	PairRM
Frameworks	HuggingFace 🤗 Transformers + TRL + PEFT
Compute	4 × A800 GPUs

📚 Dataset

Source: GAIR/LIMA
Generation Process:
- 50 instructions sampled from LIMA
- Each instruction was completed 5 times using the base model
- Pairwise preferences generated using llm-blender/PairRM
Final Format: DPO-formatted JSONL

📁 Dataset Repository: jasperyeoh2/mistral-dpo-dataset

🧪 Evaluation

10 unseen instructions from the LIMA test split were used for evaluation
Completions from base vs. DPO model were compared side-by-side
DPO model demonstrated better politeness, clarity, and alignment

🚀 Usage (with PEFT)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = "mistralai/Mistral-7B-Instruct-v0.2"
adapter = "jasperyeoh2/mistral-dpo-peft"

tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.float16, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)