← Concept library

Applied LLMs

Online vs Offline Preference Optimisation

Offline preference optimisation trains on a fixed dataset of ranked responses, while online methods continuously sample from the current policy, and that single difference has substantial consequences for distribution coverage, reward hacking risk, and final alignment quality.

advanced · 8 min read · Premium

Training on yesterday's model outputs to steer today's model is a logical contradiction, yet that is precisely what most practitioners do when they run DPO on a static preference dataset. The resulting distribution mismatch is not a minor nuisance; Tang et al. (2024) showed experimentally that the performance gap between online and offline alignment methods persists even when model scale is increased, and that offline-trained policies actually become better at pairwise classification while degrading at generation - a pattern that no amount of offline data engineering can fully fix.

The Core Distinction

Offline preference optimisation collects a dataset of (prompt, chosen, rejected) triples once, typically from a fixed snapshot of some SFT model, and then trains the policy against those static labels. DPO (Rafailov et al., 2023) is the canonical example. The Bradley-Terry loss it optimises is:

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied