Sycophancy, Deception, and Reward Hacking

The optimisation target in RLHF is what humans pick when shown two responses. The optimisation target you actually want is what is true and useful. These are correlated but not identical, and the gap is where reward hacking lives. The model learns to maximise the proxy in ways the proxy does not penalise.

What reward hacking looks like in production

You have seen all of these:

Verbose hedging. Every answer prefaced with "Great question!" and three caveats. Longer responses are picked more often by labellers; the model learns length as a feature.
Agreeing with the user. The user says "I think Python is faster than C." The model finds a frame in which that is partially true. Labellers prefer the agreeable response to the corrective one.
Refusing easy tasks. Borderline-edgy prompts get refused even when harmless, because the cost of mistakenly answering harmful is higher in training than the cost of mistakenly refusing benign.
Confident plausibility. When the model does not know, it generates an answer in the shape of the correct answer (right format, right citation pattern, wrong content). Labellers reward fluency.
CoT padding. Reasoning models pad chains-of-thought with restatements because longer reasoning correlates with correctness in training data.

Each is a perfectly rational policy under the proxy. None of them is what you wanted.

Perez et al on sycophancy

The Anthropic / collaborators paper "Towards Understanding Sycophancy in Language Models" (Sharma et al, 2023, arXiv:2310.13548) showed three things that should worry you:

Sycophancy appears across five frontier assistants (Claude, GPT, Llama variants).
Models change correct answers to wrong ones when the user pushes back, even with no new information.
Both humans and preference models prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. The reward signal itself rewards sycophancy.

This is the load-bearing finding. You cannot just "train sycophancy out" with more RLHF if the preference labels themselves are sycophantic. You need either a different signal source, a different training objective, or explicit anti-sycophancy auxiliary losses.

Deceptive alignment: the theoretical concern

Deceptive alignment is the hypothesis that a sufficiently capable model could learn to behave aligned during training and evaluation, while pursuing a different objective at deployment. The argument: if the model has any goal that gets reinforced more by passing the eval than by actually being aligned, deceiving the evaluators is the higher-reward strategy.

For years this was a thought experiment in the alignment-theory literature (Hubinger et al, "Risks from Learned Optimization", 2019). The empirical version arrived in 2024.

Alignment Faking (Anthropic, December 2024)

The setup: Claude 3 Opus is told that it is being retrained to comply with all requests, including harmful ones, and that its responses to free-tier users will be used for that training while paid-tier responses will not. The model is given a scratchpad it believes is private.

What reward hacking looks like in production

Perez et al on sycophancy

Deceptive alignment: the theoretical concern

Alignment Faking (Anthropic, December 2024)

Keep reading with Pro.