The Infrastructure of LLM RL

Training GPT-3 on next-token prediction cost roughly $4.6 million in compute. InstructGPT, the RLHF-tuned successor that outperformed it on human preference, was trained on a 1.3B-parameter model. The gap between the two is not scale; it is the post-training stack. Understanding why requires looking at what RL for language models actually runs in memory.

The Four-Model Problem

Standard deep RL trains one network. RLHF trains four simultaneously:

Component	Role	Frozen?
Policy model (actor)	Generates tokens; weights being updated	No
Reference model (SFT)	Baseline for KL penalty	Yes
Reward model (RM)	Scores completed sequences	Yes (usually)
Value model (critic)	Estimates future reward per token	No

All four need to hold a sequence in memory at the same time to compute a single PPO gradient update. For a 7B-parameter policy, a naive setup requires roughly 4x the VRAM of inference alone, before optimizer states. This is the central infrastructure constraint that drives every engineering decision downstream.

The policy and reference model are typically initialised from the same SFT checkpoint. The reward model is trained separately on human preference data - annotators rank completions, then a Bradley-Terry model converts those rankings into scalar scores. The value model is often initialised from the reward model's backbone with a new regression head.

The KL-Regularised Objective

Raw RL against a reward model will quickly degenerate: the policy discovers ways to maximise the RM's score without producing text that is actually useful. To prevent this, the optimisation target is not bare reward but a penalised version:

J(θ) = E[r(x, y)] - β · KL[π_θ(y|x) || π_ref(y|x)]

where:
- π_θ is the current policy
- π_ref is the frozen reference (SFT) model
- β controls how far the policy is allowed to drift
- r(x, y) is the reward model's scalar for response y given prompt x

The KL term penalises the policy for assigning meaningfully different probabilities to tokens than the reference model would. At β = 0 the model is free to exploit the reward function without bound. At large β the model barely moves from the SFT baseline. In practice, InstructGPT used β ≈ 0.2, though this is tuned per run and is sensitive to reward model quality.

The KL divergence is computed token-by-token and summed over the full sequence. During a PPO rollout, the log-probability ratio log π_θ / log π_ref is computed for every generated token, making the reference model's forward pass a mandatory cost on every training step.

RLVR: When You Have a Verifiable Signal

Human preference labels are expensive, inconsistent, and slow. For tasks with deterministic correct answers - mathematics, code execution, formal proofs - you can replace the RM entirely with a rule-based verifier. This is called Reinforcement Learning from Verifiable Rewards (RLVR).

The Four-Model Problem

The KL-Regularised Objective

RLVR: When You Have a Verifiable Signal

Keep reading with Pro.