🧠 Mistral-7B DPO Fine-Tuned Adapter (PEFT)

This repository hosts a PEFT adapter trained via Direct Preference Optimization (DPO) using LoRA on top of mistralai/Mistral-7B-Instruct-v0.2. The preference dataset was generated with PairRM, a reward model capable of ranking responses with strong human alignment.


πŸ“¦ Model Details

Attribute Value
Base Model Mistral-7B-Instruct-v0.2
Training Method DPO (Direct Preference Optimization)
Adapter Type PEFT - LoRA
Preference Model PairRM
Frameworks HuggingFace πŸ€— Transformers + TRL + PEFT
Compute 4 Γ— A800 GPUs

πŸ“š Dataset

  • Source: GAIR/LIMA
  • Generation Process:
    • 50 instructions sampled from LIMA
    • Each instruction was completed 5 times using the base model
    • Pairwise preferences generated using llm-blender/PairRM
  • Final Format: DPO-formatted JSONL

πŸ“ Dataset Repository: jasperyeoh2/mistral-dpo-dataset


πŸ§ͺ Evaluation

  • 10 unseen instructions from the LIMA test split were used for evaluation
  • Completions from base vs. DPO model were compared side-by-side
  • DPO model demonstrated better politeness, clarity, and alignment

πŸš€ Usage (with PEFT)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = "mistralai/Mistral-7B-Instruct-v0.2"
adapter = "jasperyeoh2/mistral-dpo-peft"

tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.float16, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)
Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support