PPO for RLHF in Practice

OpenAI's InstructGPT demonstrated that a 1.3 B parameter model fine-tuned with RLHF was preferred over the raw 175 B GPT-3 by human evaluators. The mechanism behind that jump is not magic: it is Proximal Policy Optimisation applied to a learned reward signal. Understanding why PPO is used here, rather than simpler gradient methods, and how it fits into the four-model apparatus, is the minimum prerequisite for diagnosing alignment training runs in practice.

The four-model apparatus

RLHF with PPO keeps four distinct models in memory simultaneously. Conflating them is the single most common source of confusion.

Role	Notation	Updated?
Policy (the LM being trained)	\(\pi_\theta\)	Yes, by PPO gradients
Reference policy (frozen SFT checkpoint)	\(\pi_\text{ref}\)	No
Reward model	\(r_\phi\)	No (during RL phase)
Value function (critic)	\(V_\psi\)	Yes, jointly or separately

The policy is the model you care about. The reference policy is its initialisation point, kept frozen so you can measure how far the policy has drifted. The reward model was trained on human preference comparisons; it scores any (prompt, completion) pair with a scalar. The value function estimates expected future reward from any intermediate token position and is required to compute generalised advantage estimates (GAE).

All four models are forward-passed on every training step, which is why RLHF is GPU-memory intensive. Techniques like LoRA on the policy (keeping the reference as the frozen backbone) or sharing the backbone between policy and value head reduce cost, but complicate the training loop.

The shaped reward and why it matters

The reward signal the policy actually optimises is not \(r_\phi\) alone. It is:

\[R(x, y) = r_\phi(x, y) - \beta \cdot D_\text{KL}\!\left(\pi_\theta(\cdot|x) \,\|\, \pi_\text{ref}(\cdot|x)\right)\]

where \(x\) is the prompt, \(y\) is the generated response, and \(\beta\) is a tunable coefficient (InstructGPT used \(\beta \approx 0.02\)).

The KL term penalises the policy for generating token distributions that diverge from the reference. Without it, the policy rapidly exploits the reward model's blind spots: finding short, formulaic, or linguistically bizarre completions that score high on \(r_\phi\) but look nothing like coherent text. This is reward hacking, and it happens within hundreds of gradient steps if \(\beta = 0\).

Anthropic's 2022 training analysis found a roughly linear relationship between RL reward and \(\sqrt{D_\text{KL}}\), which suggests a natural operating point exists and that returns diminish sharply past a certain KL budget.

Why PPO and not a simpler policy gradient

Vanilla REINFORCE updates the policy with:

\[\nabla_\theta J = \mathbb{E}\!\left[R(x,y)\,\nabla_\theta \log \pi_\theta(y|x)\right]\]

Two problems make this impractical for LLMs. First, it is extremely high-variance on long sequences where the reward is sparse (a single scalar at the end of hundreds of tokens). Second, a large gradient step can collapse the policy irreversibly; there is no upper bound on how far parameters move.

The four-model apparatus

The shaped reward and why it matters

Why PPO and not a simpler policy gradient

Keep reading with Pro.