Reinforcement Learning from Human Feedback

RLHF is the technique that turned pretrained next-token predictors into helpful assistants. It has three phases, each subtle, and the modern alternative (DPO) is worth understanding too.

The three phases

1. Supervised fine-tuning (SFT). Train on a curated dataset of demonstrations: prompts paired with high-quality completions written by humans. The model learns the basic shape of "be helpful."

2. Reward modeling. Collect preference data: human labellers see two model completions for the same prompt and pick the better one. Train a reward model r(prompt, response) -> scalar to predict these preferences.

3. RL with PPO. Optimise the policy (the LLM) against the reward model using Proximal Policy Optimization, with a KL penalty against the SFT model to prevent it drifting into degenerate but high-reward outputs.

Why it works

The reward model captures preferences that are easy to compare pairwise but hard to specify as a loss function ("be helpful but not sycophantic"). PPO with the KL constraint pushes the model toward those preferences without losing language capability.

DPO: skip the reward model

Direct Preference Optimization rewrites the RLHF objective as a classification loss directly on preferred vs rejected pairs. No reward model, no PPO. Significantly simpler to implement, often matches or beats RLHF on benchmarks. It is now the production default at many labs.

Failure modes

Reward hacking. The model finds outputs that score high but humans hate (verbose, hedging, lecturing).
Distribution shift. SFT distribution and RLHF distribution diverge; the reward model becomes uncalibrated.
Mode collapse. Output diversity drops; the model converges on a single "safe" answer template.

The Anthropic Constitutional AI paper proposes RLAIF (using AI critiques instead of human labels) as a way to scale labels cheaply while controlling reward hacking.

The three phases

Why it works

DPO: skip the reward model

Failure modes

Keep reading with Pro.