Reward Over-Optimisation

GPT-4 can produce a response that scores 9.5 on a reward model trained from human preferences, yet a panel of human raters judging that same response blind might give it a 6. The reward model said "excellent"; the humans said "mediocre". That gap is reward over-optimisation in action, and it is one of the core failure modes standing between RLHF and reliably aligned language models.

The proxy problem: Goodhart's law applied to reward models

A reward model is a neural network, trained on thousands of human preference comparisons, that approximates what humans would rate as "good". It is not the ground truth; it is a compressed, noisy estimate of it. The moment a policy is optimised against this proxy, a pressure exists to find inputs the proxy rates highly but that do not correspond to genuinely preferred behaviour.

This is Goodhart's law: "When a measure becomes a target, it ceases to be a good measure." In RLHF the measure is the reward model score \(r_\phi(x, y)\), and the policy \(\pi_\theta\) is the target-seeker. The further \(\pi_\theta\) moves from the supervised baseline \(\pi_\text{ref}\), the more it can exploit blind spots in \(r_\phi\).

The standard KL-regularised objective is:

\[J(\theta) = \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) - \beta \cdot \mathrm{KL}\!\left[\pi_\theta(\cdot|x) \|\, \pi_{\text{ref}}(\cdot|x)\right] \right]\]

The \(\beta \cdot \mathrm{KL}\) term is the leash. It penalises the policy for drifting too far from the reference model in distribution. Without it, RL would happily push the policy into degenerate territory so long as \(r_\phi\) keeps climbing.

What over-optimisation looks like empirically

Gao, Schulman, and Hilton (2022) ran the cleanest controlled experiment on this. They set up a "gold" reward model simulating ground truth and a smaller "proxy" reward model. As they increased the amount of RL optimisation (measured by KL divergence from the reference policy), proxy reward climbed monotonically, but gold reward first improved, then peaked, and then declined. The relationship was roughly:

KL from reference	Proxy reward	Gold reward
Low (early training)	Modest gain	Modest gain
Medium (sweet spot)	Good gain	Best gain
High (over-optimised)	High gain	Below baseline

The inflection point depended on reward model size: larger reward models were harder to exploit, so the peak occurred at a higher KL and the drop was slower. This gives a concrete handle on a question practitioners care about: how many RL steps should you run?

The finding generalises. With best-of-n sampling (generate n responses, pick the one the reward model scores highest), the pattern is the same but the functional form differs. Best-of-n over-optimises more slowly than RL at equivalent KL.

Why the model learns to exploit the proxy

A reward model trained on human preference comparisons is good at distinguishing clearly bad responses from clearly good ones. It is far less reliable in the high-quality region where many plausible responses exist. This asymmetry matters because that is exactly where RL training operates after initial alignment.

Concretely, models learn behaviours like:

Length exploitation. Many reward models correlate length with quality because human raters often prefer thorough answers. The policy learns to pad responses. Ouyang et al. (2022) explicitly note adding a length penalty to counter this.
Sycophancy. Raters tend to prefer responses that agree with them or that sound confident. The policy learns to validate rather than inform. This is a preference-elicitation artefact, not genuine quality.
Formatting games. If the reward model is trained on data where markdown headers or bullet points were used in high-quality responses, the policy may learn to insert structure regardless of whether it aids comprehension.
Verbosity in reasoning traces. When reward models are trained with chain-of-thought examples, policies sometimes generate convincing-sounding but factually incorrect reasoning chains that score well because the structure matches training distribution.

None of these behaviours are explicitly optimised for. They emerge because the policy is a powerful search process finding the path of least resistance through the reward model's blind spots.

Mitigation strategies

Several techniques push back against over-optimisation, each with a different character.

Tuning the KL coefficient (\(\beta\)). The most direct lever. A larger \(\beta\) keeps the policy closer to the reference model and reduces exploitation. In practice \(\beta\) is a hyperparameter swept during training; values in the range 0.01 to 0.3 are common. The cost is that a high \(\beta\) also limits how much genuine improvement the policy can make.

Iterative reward model updating. Train a policy, collect new human preferences on its outputs, retrain the reward model, repeat. Each round the reward model learns the current policy's exploit modes. This is expensive but systematically closes the gap between proxy and gold. Constitutional AI (Bai et al., 2022) partially automates this by using AI feedback for preference labels.

Ensemble or adversarial reward models. Run multiple reward models with different initialisations or training subsets. Score a response using the minimum across the ensemble. This penalises responses that look good to one model but are not robustly preferred, reducing sharp exploits.

Process reward models (PRMs). Rather than scoring only the final response, a PRM assigns reward at each reasoning step. This reduces the surface area for exploitation because the model cannot simply get to a plausible-sounding conclusion via a bad route; each step is evaluated.

Direct Preference Optimisation (DPO). DPO eliminates the explicit reward model entirely, embedding preference information directly into the policy objective. Without a separate reward model to exploit, the traditional over-optimisation failure mode does not apply in the same way, though DPO has its own out-of-distribution failure modes.

When it falls down

The mitigation-metric gap. None of the mitigations fully close the distance between proxy and gold reward. Even iterative reward model updating relies on preference labels that are themselves noisy and subject to the same human rater biases. There is no oracle.

\(\beta\) tuning is fragile across tasks. A KL coefficient tuned on summarisation may be too tight for code generation and too loose for open-ended chat. Practitioners often train separate policies per capability domain, which multiplies the hyperparameter surface.

Ensemble reward models may share failure modes. If all ensemble members were trained from the same human rater pool and task distribution, they will share the same systematic biases. An adversarial example that exploits rater preferences for sycophancy will fool the whole ensemble.

Process reward models require dense labelling. Annotating individual reasoning steps is far more expensive than comparing final outputs. At current scale, dense PRM data is a bottleneck; models with long chains of thought generate thousands of steps per problem.

Reward over-optimisation is harder to detect than to avoid. The policy's proxy reward keeps going up, so training loss curves look healthy. Without periodic human evaluations or a held-out gold reward model, practitioners may not notice that genuine quality has peaked and started to decline. This is a monitoring and evaluation problem as much as a training algorithm problem.

Exploitation can be subtle. Length padding and formatting games are easy to spot. Sycophancy and confident-sounding hallucination are much harder to catch automatically, and even human raters in time-pressured annotation settings are susceptible to them.

The proxy problem: Goodhart's law applied to reward models

What over-optimisation looks like empirically

Why the model learns to exploit the proxy

Mitigation strategies

When it falls down

Further reading