Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL Paper • 2505.02391 • Published 3 days ago • 21
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training Paper • 2504.13161 • Published 21 days ago • 88
RLHFlow MATH Process Reward Model Collection This is a collection of datasets and models of process reward modeling. • 15 items • Updated Nov 9, 2024 • 10