The KL-Regularised RL Objective

A 1.3 billion-parameter InstructGPT model is preferred by human raters over a 175x larger raw GPT-3, despite seeing no additional capability training. The difference is not scale; it is the objective the model was optimised against. That objective contains a term most practitioners gloss over: the KL divergence penalty. Understanding why it is there, what it is doing at every training step, and precisely where it breaks down separates practitioners who can tune RLHF pipelines from those who cargo-cult hyperparameters.

The Bare Reward Problem

Reinforcement learning requires a scalar reward signal. In language model post-training the reward typically comes from a learned reward model (RM) trained on human preference comparisons. Naively, the RL objective is:

maximise E_{x~D, y~pi_theta} [ r_phi(x, y) ]

where x is a prompt drawn from distribution D, y is the response sampled from the current policy pi_theta, and r_phi is the reward model parameterised by phi.

This objective is straightforward but dangerous. The reward model is an imperfect proxy for human preference. It was fitted on a finite dataset of comparison pairs, so it has a definite generalisation boundary. The policy has every incentive to find inputs that score high on the proxy while drifting far from the distribution on which the RM was evaluated. This is Goodhart's Law in closed-loop form: once a measure becomes a target, it ceases to be a good measure. Concretely, a policy trained purely against reward rapidly learns to produce degenerate text - repetitive phrases, garbled tokens, confident nonsense - that happens to exploit blind spots in the reward model.

Adding the KL Term

The standard fix, introduced in the RM fine-tuning work on summarisation (Stiennon et al., 2020) and solidified in InstructGPT (Ouyang et al., 2022), is to regularise the objective with a KL divergence between the current policy and a frozen reference policy:

maximise E_{x~D, y~pi_theta} [ r_phi(x, y) - beta * KL(pi_theta(y|x) || pi_ref(y|x)) ]

Written out token by token, the KL term is:

KL(pi_theta || pi_ref) = sum_t log( pi_theta(y_t | x, y_{<t}) / pi_ref(y_t | x, y_{<t}) )

This sum accumulates over every generated token. A single response that diverges only slightly at each step still accrues a meaningful penalty by the end of a long generation - by design.

The reference policy pi_ref is almost always the supervised fine-tuned (SFT) checkpoint that precedes RL training. It is frozen throughout RL and acts as an anchor. The scalar beta controls the trade-off:

beta value	Effect
0	Pure reward maximisation; RM exploitation inevitable
very small (0.01-0.05)	Light regularisation; policy can drift; reward may spike then collapse
moderate (0.1-0.3)	Typical operating range in most published RLHF systems
large (>1)	Policy barely moves; essentially SFT behaviour preserved

The Bare Reward Problem

Adding the KL Term

Keep reading with Pro.