TRL documentation
Reward Functions
Reward Functions
This module contains some useful reward functions, primarily intended for use with the GRPOTrainer.
Format rewards
think_format_reward
trl.rewards.think_format_reward
< source >( completions: list **kwargs ) → list[float]
Parameters
- completions (
list[list[dict[str, str]]]
) — List of completions to be evaluated. Each completion must be a list of one message, i.e. a dictionary containing the key"content"
with the value being the text of the completion. - **kwargs — Additional keyword arguments. This function does not use them, but they are required in the function signature to ensure compatibility with trainers like GRPOTrainer.
Returns
list[float]
A list of rewards, where each reward is 1.0 if the completion matches the expected format, otherwise 0.0.
Reward function that checks if the reasoning process is enclosed within "<think>"
and "</think>"
tags. The
function returns a reward of 1.0 if the format is correct, otherwise 0.0.