← Concept library

Applied LLMs

DPO in Practice

DPO eliminates the separate reward model and RL loop of classic RLHF by reparameterising the reward directly into a classification loss over preferred and rejected response pairs.

intermediate · 7 min read · Premium

The problem with PPO-based RLHF is not theoretical: it is operational. You need to train a reward model, keep a frozen reference policy in GPU memory alongside the live policy, sample from the policy during training, run a KL-penalised RL update, and tune at least four hyperparameters that interact badly. For a 70B model this translates to weeks of engineering before you see a single useful gradient. DPO (Direct Preference Optimisation, Rafailov et al. 2023) collapses that entire pipeline into a single binary cross-entropy pass over preference pairs.

The Maths in One Screen

Standard RLHF maximises a KL-penalised reward objective:

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied