RLVR: RL from Verifiable Rewards

The central bottleneck in standard RLHF is not the RL algorithm; it is the reward model. A learned reward model is itself a neural network trained on human preferences, which means it is wrong in subtle ways and can be gamed. The moment a policy learns to produce outputs that fool the reward model rather than outputs that are actually good, you have entered the regime Gao et al. (2022) call "reward over-optimisation." Their empirical study found that proxy reward keeps climbing while gold reward plateaus and then declines - a textbook instance of Goodhart's law.

RLVR sidesteps this by switching the reward signal source. Instead of asking "what does the reward model score this output?", you ask "is the final answer correct?" For domains where a ground-truth answer exists and can be checked automatically - maths problems, coding tasks, formal proofs, factual look-ups with known answers - you get a reward signal that is binary, cheap, and essentially unhackable at the semantic level. No proxy model, no preference labels, no drift from the gold standard.

What Makes a Reward "Verifiable"

A verifiable reward has three properties:

Property	Description	Example
Deterministic	Same answer always scores the same	`2 + 2 = 4` is always correct
Automatic	No human in the loop per sample	Regular-expression match against expected output
Ground-truth-anchored	Correctness is objective, not a model's opinion	Final numerical answer on a maths benchmark

This is a strict subset of all possible reward signals. RLVR applies well to maths, competitive coding, SQL generation, unit-tested software, and theorem proving. It does not apply straightforwardly to open-ended generation tasks like summarisation or dialogue, where "correct" is not well-defined - those still need a learned reward model or human preference data.

The verification function v(y, y*) is usually trivially simple: string normalisation followed by exact match, or executing code against a test suite. Lightman et al. (2023) showed that even a coarse outcome-level signal (did the model get the final answer right?) is a powerful training driver, though they also demonstrated that step-level (process) supervision can be stronger when available.

GRPO: The RL Algorithm Behind DeepSeek-R1

The DeepSeek-R1 paper (2025) demonstrated that a language model can develop sophisticated chain-of-thought reasoning through pure RL on verifiable rewards, without any supervised fine-tuning on human-labelled reasoning traces. The RL algorithm they used is Group Relative Policy Optimisation (GRPO), first introduced in DeepSeekMath (Shao et al., 2024) as a memory-efficient alternative to PPO.

In standard PPO for language models, you need four models in memory: the policy, the reference policy (for KL regularisation), a critic (value function), and the reward model. GRPO eliminates the critic entirely.

What Makes a Reward "Verifiable"

GRPO: The RL Algorithm Behind DeepSeek-R1

Keep reading with Pro.