πŸ† Model Card for llm-course-hw2-reward-model

This model is a fine-tuned reward model based on HuggingFaceTB/SmolLM-135M-Instruct, trained on the HumanLLMs/Human-Like-DPO-Dataset dataset.
It has been trained using TRL to evaluate and rank responses based on human preferences, playing a crucial role in RLHF (Reinforcement Learning from Human Feedback) for models like SmolLM-135M-PPO.


πŸ“ Overview

  • Base Model: SmolLM-135M-Instruct
  • Fine-Tuned Dataset: HumanLLMs/Human-Like-DPO-Dataset
  • Objective: Learn to assign higher scores to more engaging, structured, and emotional responses.
  • Use Case: Used in PPO-based RLHF training to reinforce human-like response quality.

Training Method

  • The model was fine-tuned using Direct Preference Comparisons:
    • Each sample contains a chosen response (preferred) and a rejected response.
    • The model learns to assign higher rewards to the chosen response and lower rewards to the rejected one.
  • This reward function was used in PPO fine-tuning to optimize response generation.

Downloads last month
1
Safetensors
Model size
135M params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tsessk/llm-course-hw2-reward-model

Finetuned
(150)
this model

Dataset used to train tsessk/llm-course-hw2-reward-model

Collection including tsessk/llm-course-hw2-reward-model