Large Language Models
DPO and Preference Optimisation
How Direct Preference Optimisation collapses the reward-model-plus-PPO pipeline into a single classification loss, and where the RLHF machinery still earns its keep.
advanced · 9 min read · Premium
Classic RLHF aligns a model in three moving parts: supervised fine-tuning, a separately trained reward model, and a reinforcement-learning loop (PPO) that optimises the policy against that reward while a KL penalty keeps it from drifting (see rlhf). It works, and it is finicky: an RL loop that samples during training, a reward model that can be gamed, and a stack of hyperparameters that fail in subtle ways. Direct Preference Optimisation (DPO) asked whether all that apparatus was necessary to learn from the same preference data. The answer, surprisingly, was no.
The key insight
The DPO paper's subtitle says it: "Your Language Model Is Secretly a Reward Model." The RLHF objective, maximise reward under a KL constraint to a reference policy, has a known closed-form optimal policy: the reference policy reweighted by the exponentiated reward. DPO inverts that relationship. If the optimal policy is a function of the reward, then the reward is a function of the policy, specifically the log-ratio between the trained policy and the reference model. Substitute that expression back into the reward model's own training loss and the explicit reward cancels out. What remains is a simple classification loss over preference pairs that you optimise directly on the language model.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.