Applied LLMs
IPO and the Overfitting Fix
IPO replaces DPO's sigmoid loss with a squared identity transform, eliminating the theoretical overfitting guarantee that breaks when preference data is finite and deterministic.
advanced · 7 min read · Premium
DPO reduced RLHF to a binary classification problem, shipped in a weekend, and outperformed PPO on several benchmarks. The community celebrated. Then Azar et al. (2023) pointed out that DPO's core loss has a theoretical failure mode: given infinite gradient steps, it will drive the policy to assign zero probability to every rejected completion, regardless of what the preference data actually says. This is not a hyperparameter problem. It is a structural consequence of the sigmoid transform.
IPO (Identity Preference Optimisation) is the fix. It replaces one line in the DPO loss, costs nothing extra at inference, and comes with a formal performance guarantee that DPO lacks. Understanding why this replacement is necessary requires tracing back through a chain of approximations most practitioners skip.
The two approximations buried inside DPO
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.