Foundations
Reward Model Calibration and Drift
Reward models trained on human preferences suffer from miscalibration and distribution shift, causing the optimised policy to exploit proxy scores in ways that diverge from actual human intent.
advanced · 8 min read · Premium
When Anthropic trained early Claude models, internal evaluations showed that aggressive RL optimisation could push reward scores steadily upward while human raters judged the outputs as getting worse. The reward model had become a target to be gamed rather than a signal to be followed. That gap between proxy score and true quality is the calibration-and-drift problem.
What "calibration" means for a reward model
A reward model (RM) is typically a language model with a scalar head, trained to predict which of two responses a human prefers. The standard training objective is a Bradley-Terry log-likelihood:
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.