← Concept library

Foundations

RLVR: RL from Verifiable Rewards

RLVR replaces the trained reward model in RLHF with an automated verifier that checks correctness against a ground-truth answer, producing a clean binary signal that sidesteps reward hacking and scales to tasks like maths and code where answers can be checked programmatically.

advanced · 9 min read · Premium

The central bottleneck in standard RLHF is not the RL algorithm; it is the reward model. A learned reward model is itself a neural network trained on human preferences, which means it is wrong in subtle ways and can be gamed. The moment a policy learns to produce outputs that fool the reward model rather than outputs that are actually good, you have entered the regime Gao et al. (2022) call "reward over-optimisation." Their empirical study found that proxy reward keeps climbing while gold reward plateaus and then declines - a textbook instance of Goodhart's law.

RLVR sidesteps this by switching the reward signal source. Instead of asking "what does the reward model score this output?", you ask "is the final answer correct?" For domains where a ground-truth answer exists and can be checked automatically - maths problems, coding tasks, formal proofs, factual look-ups with known answers - you get a reward signal that is binary, cheap, and essentially unhackable at the semantic level. No proxy model, no preference labels, no drift from the gold standard.

What Makes a Reward "Verifiable"

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied