Reasoning RL and R1-Style Training

When DeepSeek released R1 in January 2025, the first public shock was the benchmark numbers. The quieter shock, on reading the paper, was the training recipe: no human-annotated reasoning trajectories, no teacher distillation for the RL phase, just a language model generating answers and a binary reward signal saying "correct" or "wrong." Self-reflection and chain-of-thought appeared to emerge from that signal alone. The question this concept addresses is why that works, and what it implies for post-training methodology more broadly.

RLHF Is Already RL

It is worth being precise about vocabulary before the terminology splits into RLHF, RLVR, GRPO, and R1-style training.

Classic RLHF (InstructGPT; Ouyang et al., 2022) is a three-stage pipeline:

Supervised fine-tuning (SFT) on high-quality demonstrations.
Train a reward model (RM) on pairs of completions ranked by human annotators.
Optimise the SFT model against the RM using PPO, subject to a KL penalty anchoring the policy to the SFT reference.

Stage 3 is genuine RL: the policy is the LLM, the action space is token generation, the reward is the RM score, and the KL term is a regulariser that prevents the policy from drifting so far from the reference that the RM is operating out of distribution. The objective is:

J(θ) = E[r_φ(x, y)] - β · KL[π_θ(y|x) || π_ref(y|x)]

where r_φ is the learned reward model, π_θ is the current policy, π_ref is the SFT reference, and β controls how tightly the policy is leashed. A small β lets the policy chase reward aggressively; a large β keeps it close to the SFT distribution.

The human feedback is upstream of stage 3. RL from human feedback is conceptually a two-part system: a reward learning problem (fitting r_φ) and a policy optimisation problem (maximising J). RLHF entangles both; RLVR separates them by removing the learned RM entirely.

RLVR: Replacing the Reward Model with a Verifier

Reinforcement Learning from Verifiable Rewards (RLVR) is the term for training regimes where the reward signal is computable, not learned. For mathematics and coding:

A maths answer is either numerically correct or not. The verifier is a string-match or symbolic evaluator.
A code solution either passes the test suite or it doesn't.

Because the reward is ground-truth-verifiable, r_φ collapses to a deterministic function and reward model over-optimisation becomes structurally impossible for the task reward itself. The KL penalty is still present to prevent the policy drifting to degenerate token sequences, but the RM hacking failure mode is eliminated.

The practical implication is stark: you do not need a large, carefully trained reward model. You need a dataset of problems with verifiable answers, a reference policy, and a KL budget.

Reasoning RL and R1-Style Training

RLHF Is Already RL

RLVR: Replacing the Reward Model with a Verifier

GRPO: Getting PPO's Benefits Without the Critic

Keep reading with Pro.