Applied LLMs
DPO in Practice
DPO eliminates the separate reward model and RL loop of classic RLHF by reparameterising the reward directly into a classification loss over preferred and rejected response pairs.
intermediate · 7 min read · Premium
The problem with PPO-based RLHF is not theoretical: it is operational. You need to train a reward model, keep a frozen reference policy in GPU memory alongside the live policy, sample from the policy during training, run a KL-penalised RL update, and tune at least four hyperparameters that interact badly. For a 70B model this translates to weeks of engineering before you see a single useful gradient. DPO (Direct Preference Optimisation, Rafailov et al. 2023) collapses that entire pipeline into a single binary cross-entropy pass over preference pairs.
The Maths in One Screen
Standard RLHF maximises a KL-penalised reward objective:
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.