RLHF - a dipta007 Collection

dipta007 's Collections

open-r1-resources

scify

Small Multimodal Models

Research-Helpers

LLM to annotate Dataset

MediQA

VLM

Multimodal Dataset

Efficient Training

RLHF

RLHF

updated Mar 19, 2024

Proximal Policy Optimization Algorithms

Paper • 1707.06347 • Published Jul 20, 2017 • 8
Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Paper • 2305.18290 • Published May 29, 2023 • 58
Self-Rewarding Language Models

Paper • 2401.10020 • Published Jan 18, 2024 • 148
Training language models to follow instructions with human feedback

Paper • 2203.02155 • Published Mar 4, 2022 • 17
Self-Instruct: Aligning Language Model with Self Generated Instructions

Paper • 2212.10560 • Published Dec 20, 2022 • 9
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback

Paper • 2305.14387 • Published May 22, 2023 • 1
ORPO: Monolithic Preference Optimization without Reference Model

Paper • 2403.07691 • Published Mar 12, 2024 • 65