Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Posts
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
dipta007 's Collections
open-r1-resources
scify
Interesting
helpful-demos
leaderboards
Small Multimodal Models
Research-Helpers
LLM to annotate Dataset
MediQA
VLM
Multimodal Dataset
Efficient Training
RLHF

RLHF

updated Mar 19, 2024
Upvote
-

  • Proximal Policy Optimization Algorithms

    Paper • 1707.06347 • Published Jul 20, 2017 • 8

  • Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Paper • 2305.18290 • Published May 29, 2023 • 58

  • Self-Rewarding Language Models

    Paper • 2401.10020 • Published Jan 18, 2024 • 148

  • Training language models to follow instructions with human feedback

    Paper • 2203.02155 • Published Mar 4, 2022 • 17

  • Self-Instruct: Aligning Language Model with Self Generated Instructions

    Paper • 2212.10560 • Published Dec 20, 2022 • 9

  • AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback

    Paper • 2305.14387 • Published May 22, 2023 • 1

  • ORPO: Monolithic Preference Optimization without Reference Model

    Paper • 2403.07691 • Published Mar 12, 2024 • 65
Upvote
-
  • Collection guide
  • Browse collections
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs