RLHF: Reward Modeling, PPO, and the DPO Trade-off

The Setup

RLHF (Reinforcement Learning from Human Feedback) aligns a pretrained language model with human preferences in three stages: supervised fine-tuning (SFT), reward model (RM) training, and policy optimization (typically PPO). The motivation is simple: maximum-likelihood training on internet text produces a model that imitates, not one that prefers helpful, honest outputs. Preferences are easier to collect than gold answers, so we learn from comparisons.

Stage 1: The Reward Model

Annotators see a prompt x with two candidate completions y_w (preferred) and y_l (rejected) sampled from the SFT model. We train a scalar reward function r_θ(x, y), usually the SFT model with its LM head replaced by a linear scalar head, under the Bradley-Terry preference model:

P(y_w > y_l | x) = σ(r_θ(x, y_w) - r_θ(x, y_l))

The loss is the negative log-likelihood of this:

loss = -F.logsigmoid(r_w - r_l).mean()

A few practical notes:

The RM only needs to rank, not produce calibrated values. Adding a constant to r_θ changes nothing.
RM accuracy on held-out pairs typically plateaus at 65-75%; humans disagree about that often.
The RM is the alignment bottleneck. Reward hacking downstream is mostly RM error being exploited.

Stage 2: PPO Against the Reward Model

We now treat the LM as a policy π_φ(y | x) and maximize expected reward. The naive objective E[r_θ(x, y)] collapses to degenerate text that hacks the RM. The fix is a per-token KL penalty against the SFT reference π_ref:

R(x, y) = r_θ(x, y) - β * KL(π_φ(·|x) || π_ref(·|x))

PPO (Proximal Policy Optimization) optimizes this with a clipped surrogate objective that prevents large per-step policy updates:

L_PPO = E[ min( ρ_t * A_t, clip(ρ_t, 1-ε, 1+ε) * A_t ) ]
ρ_t = π_φ(a_t | s_t) / π_φ_old(a_t | s_t)

A_t is the advantage, estimated with GAE from a learned value head. So during PPO you are holding four models in memory: policy, value (critic), reward model, and frozen reference. Training loop:

Sample completions from the current policy on prompts.
Score them with the RM, subtract the KL-to-reference penalty token-by-token.
Compute advantages via the value head and GAE.
Take a few PPO epochs over the rollout with the clipped objective.
Repeat.

The KL coefficient β (sometimes adapted online) is the single most important knob. Too low: the policy drifts off-distribution and reward-hacks. Too high: no learning.

Why PPO Is Painful

Engineering load: four models, distributed rollouts, careful tokenwise reward shaping.
Instability: reward hacking, mode collapse, KL spikes, value function lag.
Hyperparameter sensitivity: learning rate, β, clip range, rollout size, GAE λ all interact.
Compute: rollouts dominate; you generate at every step.

Enter DPO

Direct Preference Optimization (Rafailov et al., 2023) observes that the optimal policy under the KL-regularized reward objective has a closed-form relationship to the reward:

r(x, y) = β * log(π*(y|x) / π_ref(y|x)) + const

Substitute this into the Bradley-Terry preference likelihood and the reward model vanishes. You optimize the policy directly on preference pairs:

L_DPO = -log σ( β * [ log π_φ(y_w|x)/π_ref(y_w|x)
                    - log π_φ(y_l|x)/π_ref(y_l|x) ] )

No reward model. No rollouts. No critic. Just two forward passes through the policy and reference per pair, with a standard supervised loop.

Trade-offs: PPO vs DPO

DPO wins on:
- Simplicity and stability. Looks and trains like SFT.
- Compute. No online sampling, no four-model setup.
- Reproducibility. Fewer knobs that can blow up.

PPO retains advantages:
- On-policy data: the model learns from its own current outputs, which matters when the SFT model is far from desired behavior.
- Reusable RM: one RM can supervise many policies, distillations, or best-of-N sampling at inference.
- Non-preference signals: you can mix in programmatic rewards (unit tests, format checks, safety classifiers).
- Empirical ceiling: at scale and with good RMs, PPO (and variants like RLOO, GRPO) still tends to edge out DPO on hard reasoning and instruction following.

In practice: start with DPO or its variants (IPO, KTO, SimPO) for cost and stability. Move to PPO-style RL when you need on-policy correction, mixed reward sources, or the last few points of quality.