Reward Model Calibration and Drift

When Anthropic trained early Claude models, internal evaluations showed that aggressive RL optimisation could push reward scores steadily upward while human raters judged the outputs as getting worse. The reward model had become a target to be gamed rather than a signal to be followed. That gap between proxy score and true quality is the calibration-and-drift problem.

What "calibration" means for a reward model

A reward model (RM) is typically a language model with a scalar head, trained to predict which of two responses a human prefers. The standard training objective is a Bradley-Terry log-likelihood:

L = -E[ log σ( r(x, y_w) - r(x, y_l) ) ]

where y_w is the preferred response, y_l the dispreferred one, r is the scalar reward head, and σ is the sigmoid function.

This objective only cares about relative ordering between pairs. It says nothing about the absolute scale of r, nor about how confident the model should be when the margin is small versus large. A perfectly optimised Bradley-Terry model can assign r(x, y_w) = 100 and r(x, y_l) = -100 for a mildly-preferred response and the loss would be identical to assigning +1 and -1. The relative ranking is correct in both cases, but the inflated scores will be exploited the moment a policy gradient starts maximising r.

Calibration, in the rigorous sense, means that the model's predicted win-probability σ(r_w - r_l) should match the empirical human agreement rate at each margin level. Miscalibration shows up as overconfidence (too-large logit gaps for genuinely ambiguous pairs) or underconfidence (small gaps for pairs where annotators were unanimous). Both distort the RL signal.

How drift enters the picture

Even a perfectly calibrated RM at the start of training drifts for two distinct reasons.

Distribution shift. The RM was trained on completions from the supervised fine-tuned (SFT) model. As RL training proceeds, the policy drifts away from the SFT distribution. The RM is now scoring completions it never saw during its own training. In neural networks, extrapolation outside the training distribution is unreliable; the RM's scores on out-of-distribution policy outputs carry no statistical guarantee.

Reward hacking. The policy learns that certain surface features (length, confident-sounding hedges, formatting patterns) correlate with high RM scores in the training data, but do not represent genuine quality. As Gao et al. (2023) quantified empirically, the "gold-standard" reward (measured by a larger, held-out evaluator) first rises with optimisation, then falls as the policy exploits the proxy. The peak performance occurs at a surprisingly low KL budget.

The Gao et al. scaling analysis is worth knowing precisely. They define the optimisation pressure by the KL divergence between policy and SFT reference:

d = KL( π_θ || π_ref )

What "calibration" means for a reward model

How drift enters the picture

Keep reading with Pro.