llm-course-hw2
Collection
llm course @ HSE and vk llm
A collection of SmolLM-135M models fine-tuned with DPO, PPO, and Reward Modeling to enhance human-like expressiveness
β’
3 items
β’
Updated
This model is a fine-tuned reward model based on HuggingFaceTB/SmolLM-135M-Instruct, trained on the HumanLLMs/Human-Like-DPO-Dataset dataset.
It has been trained using TRL to evaluate and rank responses based on human preferences, playing a crucial role in RLHF (Reinforcement Learning from Human Feedback) for models like SmolLM-135M-PPO.
Base model
HuggingFaceTB/SmolLM-135M