PPO for Language Models

GPT-3 could write fluent prose about almost anything, yet early user studies found it frequently produced responses that were unhelpful, dishonest, or subtly toxic. The gap between "predicts plausible tokens" and "behaves as intended" is not closed by more pretraining data. It is closed by optimisation against a signal of human preference, and the algorithm that made this practical at scale is Proximal Policy Optimisation (PPO).

What PPO actually optimises

Standard policy gradient methods compute the gradient of expected reward with respect to policy parameters. The update rule is:

∇J(θ) = E_t [ ∇ log π_θ(a_t | s_t) · A_t ]

where A_t is an advantage estimate (how much better action a_t was than average). The problem is that large gradient steps can collapse the policy: one bad update and the distribution shifts so far that all subsequent rollouts are off-distribution, producing a death spiral.

TRPO (Schulman et al., 2015) handled this with a hard KL constraint:

maximise  E_t [ π_θ(a_t | s_t) / π_θ_old(a_t | s_t) · A_t ]
subject to  KL(π_θ_old || π_θ) ≤ δ

PPO (Schulman et al., 2017) achieves the same protective effect more cheaply by clipping the probability ratio r_t(θ) = π_θ / π_θ_old directly in the objective:

L^CLIP(θ) = E_t [ min( r_t(θ) · A_t,  clip(r_t(θ), 1-ε, 1+ε) · A_t ) ]

When A_t > 0 (action was good), the min prevents over-crediting the action beyond the 1+ε boundary. When A_t < 0 (action was bad), the clip prevents punishing it harder than the 1-ε boundary allows. A typical value is ε = 0.2. This single change eliminates the need for a second-order constrained optimisation at every step, making the algorithm GPU-friendly for billion-parameter models.

How the language modelling setup maps onto the RL abstraction

RL concept	Language model equivalent
State `s_t`	Prompt tokens + tokens generated so far
Action `a_t`	Next token sampled from the policy
Episode	One full response (prompt to EOS)
Policy `π_θ`	The LLM being fine-tuned
Reward `R`	Scalar from a reward model trained on human comparisons
Reference policy `π_ref`	Frozen copy of the supervised fine-tuned (SFT) model

Because a single token is an "action" and episodes are short (typically under 1024 tokens), the RL horizon is compact enough for on-policy PPO rollouts to be tractable. The SFT model is the starting point; PPO nudges it toward higher reward without letting it drift too far from sensible language.

The full KL-regularised objective used in InstructGPT (Ouyang et al., 2022) is:

R_total(x, y) = r_φ(x, y) - β · KL[ π_θ(y|x) || π_ref(y|x) ]

What PPO actually optimises

How the language modelling setup maps onto the RL abstraction

Keep reading with Pro.