Online vs Offline Preference Optimisation

Training on yesterday's model outputs to steer today's model is a logical contradiction, yet that is precisely what most practitioners do when they run DPO on a static preference dataset. The resulting distribution mismatch is not a minor nuisance; Tang et al. (2024) showed experimentally that the performance gap between online and offline alignment methods persists even when model scale is increased, and that offline-trained policies actually become better at pairwise classification while degrading at generation - a pattern that no amount of offline data engineering can fully fix.

The Core Distinction

Offline preference optimisation collects a dataset of (prompt, chosen, rejected) triples once, typically from a fixed snapshot of some SFT model, and then trains the policy against those static labels. DPO (Rafailov et al., 2023) is the canonical example. The Bradley-Terry loss it optimises is:

L_DPO(θ) = -E_{(x, y_w, y_l) ~ D} [
    log σ( β · log π_θ(y_w|x)/π_ref(y_w|x)
           - β · log π_θ(y_l|x)/π_ref(y_l|x) )
]

Here D is fixed at collection time. As π_θ moves away from the SFT policy during training, the ratio π_θ(y|x)/π_ref(y|x) drifts, yet the labels never update to reflect what the current model actually generates. You are fitting a moving target with a stationary rubber band.

Online preference optimisation instead samples two candidate responses from π_θ at each training step, obtains a preference signal (from a reward model, an LLM judge, or humans), and updates the policy immediately on that fresh data. RLHF with PPO is the oldest example; online DPO variants (Guo et al., 2024; Xiong et al., 2023) decouple this from the PPO machinery while preserving on-policy sampling.

The practical taxonomy looks like this:

Regime	Data source	Policy at data collection	Label freshness
Offline	Static dataset	SFT model (frozen)	Stale
Iterative / hybrid	Re-sampled periodically	Current policy, batched	Periodically fresh
Online	Sampled every step	Current policy (live)	Always fresh

Iterative methods - re-collecting preference data every N steps - sit between the two extremes and often represent the best practical tradeoff.

Why Distribution Shift Matters

DPO's loss implicitly assumes that the training distribution is close to the current policy's distribution. When the policy has moved, the implicit reward signal the loss assigns to un-seen (prompt, response) pairs can be wildly miscalibrated.

Concretely: suppose the SFT model occasionally produces a verbose, rambling answer. The preference data marks that style as "rejected." After several DPO gradient steps, the model has learned to suppress verbosity, so it almost never produces verbose outputs. But the rejected samples still appear in every batch, providing a gradient signal on a response type the model no longer generates - signal that is now noise relative to the actual current policy's failure modes.

The Core Distinction

Why Distribution Shift Matters

Keep reading with Pro.