← Concept library

Foundations

Offline RL and the DPO Connection

DPO re-derives the standard KL-regularised RLHF objective and solves it in closed form, turning preference alignment into a supervised classification loss over offline data without ever sampling from the policy during training.

advanced · 9 min read · Premium

The 2023 DPO paper opens with a quietly devastating observation: every RLHF practitioner is already solving a constrained optimisation problem that has a known closed-form solution. They just didn't realise it, so they were using PPO instead.

That mismatch explains both the elegance of DPO and its limitations. Understanding the gap between them requires knowing exactly where offline RL ends and where online RL begins.

The objective everyone is actually solving

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied