DPO and Preference Optimisation

Classic RLHF aligns a model in three moving parts: supervised fine-tuning, a separately trained reward model, and a reinforcement-learning loop (PPO) that optimises the policy against that reward while a KL penalty keeps it from drifting (see rlhf). It works, and it is finicky: an RL loop that samples during training, a reward model that can be gamed, and a stack of hyperparameters that fail in subtle ways. Direct Preference Optimisation (DPO) asked whether all that apparatus was necessary to learn from the same preference data. The answer, surprisingly, was no.

The key insight

The DPO paper's subtitle says it: "Your Language Model Is Secretly a Reward Model." The RLHF objective, maximise reward under a KL constraint to a reference policy, has a known closed-form optimal policy: the reference policy reweighted by the exponentiated reward. DPO inverts that relationship. If the optimal policy is a function of the reward, then the reward is a function of the policy, specifically the log-ratio between the trained policy and the reference model. Substitute that expression back into the reward model's own training loss and the explicit reward cancels out. What remains is a simple classification loss over preference pairs that you optimise directly on the language model.

Concretely, for a prompt with a preferred response y_w and a rejected one y_l, DPO raises the model's relative log-probability of y_w over y_l, measured against the frozen reference model, and the KL constraint is baked into the objective rather than added as a separate penalty. No reward model. No sampling loop. No PPO.

Why it caught on

Simplicity. It is supervised learning. Two forward passes (policy and reference) per pair, a logistic loss, standard optimisers. No RL infrastructure to babysit.
Stability. Removing the online sampling and the separate reward model removes the two components most prone to reward hacking and divergence.
Cost. No reward-model training run, and no generation during the alignment step, so it is markedly cheaper to run.

For these reasons DPO became a default alignment method for many open models soon after release, especially where a clean preference dataset already exists.

The family: IPO, KTO, and friends

DPO opened a design space rather than closing one. IPO addresses a failure where DPO overfits to deterministic preferences and pushes the policy too hard, by regularising toward a target margin instead of an unbounded log-ratio. KTO drops the need for paired comparisons entirely: it learns from individual examples each labelled merely "good" or "bad", which is far easier and cheaper to collect than pairwise rankings, drawing on a prospect-theory model of how humans value outcomes. The shared thread is treating alignment as a direct loss on preference-shaped data rather than as reinforcement learning.

The key insight

Why it caught on

The family: IPO, KTO, and friends

Keep reading with Pro.