IPO and the Overfitting Fix

DPO reduced RLHF to a binary classification problem, shipped in a weekend, and outperformed PPO on several benchmarks. The community celebrated. Then Azar et al. (2023) pointed out that DPO's core loss has a theoretical failure mode: given infinite gradient steps, it will drive the policy to assign zero probability to every rejected completion, regardless of what the preference data actually says. This is not a hyperparameter problem. It is a structural consequence of the sigmoid transform.

IPO (Identity Preference Optimisation) is the fix. It replaces one line in the DPO loss, costs nothing extra at inference, and comes with a formal performance guarantee that DPO lacks. Understanding why this replacement is necessary requires tracing back through a chain of approximations most practitioners skip.

The two approximations buried inside DPO

DPO's elegant derivation hides two load-bearing assumptions.

Assumption 1: Pairwise preferences decompose into pointwise rewards.
DPO uses the Bradley-Terry model, which says the probability that response \(y^+\) is preferred over \(y^-\) given prompt \(x\) is:

\[P(y^+ \succ y^- \mid x) = \sigma(r(x, y^+) - r(x, y^-))\]

This is sensible for stochastic annotators. But it implies there exists a scalar reward \(r(x, y)\) for each response independently. Real human preferences are often context-dependent and non-transitive; the Bradley-Terry model smooths that complexity away.

Assumption 2: The reward estimated on the training distribution generalises to the policy distribution.
DPO eliminates the reward model by re-expressing \(r\) in terms of the optimal policy and reference model. The trick only holds if the policy and reference are close enough that the reward learned from labelled pairs transfers. On small datasets, or after many gradient steps, the policy drifts far enough from the reference that the implicit reward is no longer a reliable signal.

These assumptions work reasonably well in practice. The problem is what the DPO loss does near convergence.

Why DPO can overfit to deterministic preferences

The DPO objective is:

\[\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y^+, y^-)}\!\left[\log \sigma\!\left(\beta \left(\log\frac{\pi_\theta(y^+\mid x)}{\pi_\text{ref}(y^+\mid x)} - \log\frac{\pi_\theta(y^-\mid x)}{\pi_\text{ref}(y^-\mid x)}\right)\right)\right]\]

Define the implicit reward margin as:

\[h_\theta(x, y^+, y^-) = \log\frac{\pi_\theta(y^+\mid x)}{\pi_\text{ref}(y^+\mid x)} - \log\frac{\pi_\theta(y^-\mid x)}{\pi_\text{ref}(y^-\mid x)}\]

The loss decreases monotonically as \(h_\theta \to +\infty\). There is no plateau. The global minimum of the loss is achieved only when \(\pi_\theta(y^- \mid x) \to 0\), which collapses the rejected completions entirely.

This matters because the training data is finite. Preference pairs are labelled examples, not samples from an infinite oracle. When you have a fixed dataset and unconstrained gradient steps, DPO will memorise the training preferences by suppressing rejected probabilities to near-zero, a pathological regime where the model has effectively assigned infinite reward to chosen completions relative to rejected ones. The KL penalty (controlled by \(\beta\)) slows this down but does not prevent it.

The two approximations buried inside DPO

Why DPO can overfit to deterministic preferences

Keep reading with Pro.