The KL Penalty and Reference Model

When OpenAI published InstructGPT in 2022, the 1.3 billion parameter RLHF-tuned model was rated more helpful than the 175 billion parameter raw GPT-3, despite being 100x smaller. That result is only coherent if the fine-tuning procedure was doing something precise and disciplined. The discipline came, in large part, from a single term in the objective: a penalty proportional to the KL divergence between the fine-tuned policy and a frozen copy of the model it started from.

The reward hacking problem that makes this necessary

Reinforcement learning on a scalar reward is brutally literal. If your reward model scores responses that sound confident highly, the policy will learn to sound confident regardless of accuracy. If it scores verbose answers highly, the policy will become pathologically long-winded. This phenomenon is called reward hacking, or more formally, Goodhart's Law applied to learned reward models: once a measure becomes a target, it ceases to be a good measure.

The problem compounds because the reward model itself is imperfect. It was trained on a finite set of human comparisons; it extrapolates beyond that distribution in ways no one fully controls. A policy that maximises the reward model's score without constraint will systematically find and exploit those extrapolation errors, often producing fluent-sounding but nonsensical or even harmful text.

The fix is to add a regularisation term that penalises the policy for drifting too far from a reference distribution - the model before RL fine-tuning. The modified objective is:

r_total(x, y) = r_reward_model(x, y) - β · KL[π_θ(y|x) || π_ref(y|x)]

Here π_θ is the policy being optimised, π_ref is the frozen reference model, x is the prompt, y is the completion, and β controls the trade-off. A large β keeps the policy close to the reference; a small β lets it drift further in pursuit of reward.

What the reference model actually is

The reference model is a frozen snapshot of the supervised fine-tuned (SFT) checkpoint, the policy just before RL begins. It is not the raw pretrained base; it is the already-instruction-tuned model that has learned to follow the general format and style expected by human raters.

This matters because the SFT checkpoint already represents a large investment: it encodes language coherence, factual grounding, and formatting conventions absorbed during pretraining and sharpened during SFT. The KL penalty protects this investment. Without it, RL can and does destroy these properties in a handful of gradient steps, a failure mode informally called "policy collapse" or "alignment tax" when observed as degraded performance on standard benchmarks.

Concretely, the reference model sits in memory alongside the live policy during training. At each step, the same prompt-completion pair is scored by both models to compute the per-token log-probability ratio that forms the KL term:

KL[π_θ || π_ref] ≈ Σ_t  log π_θ(y_t | x, y_<t) - log π_ref(y_t | x, y_<t)

This is the forward KL approximated token by token over the sampled completion. Both forward KL (as above) and reverse KL variants appear in the literature; the forward variant is standard in PPO-based RLHF.

Choosing β: the alignment-capability trade-off

β is the most consequential hyperparameter in RLHF. It defines a Pareto frontier between reward maximisation and policy proximity. Several practical observations from published work:

β range (typical)	Behaviour
0.0 - 0.01	Near-unconstrained; reward hacking likely within a few thousand steps
0.02 - 0.1	Standard InstructGPT / learning-to-summarise range
0.2 - 0.5	Conservative; policy moves slowly; useful when reward model is weak
> 1.0	Policy barely moves; effectively SFT with a nudge

OpenAI's summarisation paper (Stiennon et al., 2020) reported that β values around 0.02-0.05 worked well for their task, with clear degradation at both extremes. The right value is task and reward model quality dependent. Practitioners often schedule β over training, starting conservatively and relaxing it as the reward model proves trustworthy.

The KL penalty's algebraic life in DPO

Direct Preference Optimisation (DPO, Rafailov et al., 2023) makes the connection between KL regularisation and the optimal policy explicit. The optimal solution to the KL-penalised reward maximisation problem has a closed form:

π*(y|x) ∝ π_ref(y|x) · exp( r(x, y) / β )

This says the ideal policy reweights the reference distribution by an exponentiated reward. Every sequence gets upweighted or downweighted relative to how much the reward model likes it, with β controlling how sharply that reweighting happens.

DPO rearranges this to eliminate the reward model entirely: the reference model implicitly encodes the reward via log-probability ratios. This is why DPO still requires a reference model, even though it discards the explicit reward function. The reference is not an optional regulariser; it is structurally load-bearing.

When it falls down

Reference model quality sets a ceiling. If the SFT checkpoint is poorly trained - trained on low-quality demonstrations, insufficient data, or misaligned instructions - the KL penalty preserves those flaws as a lower bound on what the RL policy can do. The constraint prevents escape downward into gibberish but also constrains upward movement toward genuinely better behaviour.

β is brittle across domains. A β calibrated for summarisation does not transfer cleanly to code generation or instruction following. The optimal value varies with the reward model's noise level, the task distribution, and the length of responses. Practitioners report needing fresh sweeps per task.

KL estimates are noisy on long completions. The per-token sum approximation of KL divergence can have high variance when completions are long (hundreds of tokens). This adds noise to the gradient signal, which can slow training or cause instability when combined with a low-quality reward signal.

Catastrophic forgetting is only partially mitigated. The KL penalty reduces forgetting on the tasks represented in the reference model, but if RL fine-tuning is on a narrow distribution (e.g., customer-service conversations only), the policy can still degrade on out-of-distribution tasks that the KL term does not directly cover.

The reference can be gamed adversarially. If an adversarial prompt shifts the reference model's own distribution sharply (e.g., via a long jailbreak prefix), the KL computed relative to that shifted reference may be small even for harmful completions. The penalty measures divergence from the reference's output on the specific prompt, not from any global notion of safe text.

The reward hacking problem that makes this necessary

What the reference model actually is

Choosing β: the alignment-capability trade-off

The KL penalty's algebraic life in DPO

When it falls down

Further reading