RLHF vs DPO in production: what we learned shipping both
May 30, 2026 · 9 min read
DPO is the right default for almost every preference-tuning project we have shipped in the last eighteen months, and the teams still defaulting to PPO are mostly paying for habit rather than results. That is the short version. The long version is that DPO's "just a classification loss" pitch hides a different set of failure modes - length inflation, distribution drift away from the SFT reference, brittle behaviour at high beta - that you only see when you have actually been on call for one of these runs. This essay is the trade-off matrix we wish we had before our first DPO run went sideways at hour fourteen.
The pitch and the reality
The classical RLHF stack from Ouyang et al. and the Anthropic HH paper is three stages: SFT, train a reward model on pairwise preferences, then run PPO (Schulman et al., 2017) against the frozen reward model with a KL penalty against the SFT reference. Four models in memory during training (policy, reference, reward, value), a sampling loop inside the training loop, and a hyperparameter surface that has eaten more weekends than we want to count.
DPO (Rafailov et al., 2023) collapses this. The derivation shows that the optimal policy under the KL-constrained RLHF objective has a closed form in terms of the reference policy and an implicit reward; you can rearrange to a simple binary cross-entropy loss over preference pairs, with the reference model providing the log-ratio normaliser. No reward model. No PPO. Two model copies (policy and frozen reference), one forward-backward pass per batch. The reality is that this works - often better than PPO at modest scale - but it works for reasons that are not exactly the reasons the paper sells.
What DPO actually buys you
Three things, in our experience.
The first is operational simplicity that compounds. The PPO loop has a value head whose initialisation matters, a GAE lambda you have to tune, a clip range, a target KL, an adaptive KL coefficient if you use one, a rollout buffer size, a number of PPO epochs per rollout, and a learning rate that has to play nicely with all of them. DPO has a learning rate, a beta, and a batch size. We have onboarded engineers to a DPO codebase in two days who would have needed two weeks to safely babysit PPO.
The second is reward-hacking by proxy goes away, because there is no proxy. A trained reward model is an over-fitted approximation of human preference, and PPO is extraordinarily good at finding the seams in it. DPO trains against the preferences directly, so the model can only "hack" the preference distribution itself - which it does, mostly via length, as we will get to.
The third is reproducibility. PPO runs with the same seed and the same data drift apart because of the rollout sampling. DPO runs are deterministic given seed and data. This matters more for debugging than the paper makes clear: when something regresses, you can actually bisect.
What DPO does not buy you
DPO is supposed to be the safe choice. It is not, quite.
The first thing that bites everyone is length bias. DPO will, almost without exception, push your model toward longer responses than the SFT reference, because longer responses dominated the chosen side of most public preference datasets. We saw mean response length grow by 30-60% across runs on UltraFeedback-style data, with no improvement on the underlying task. SimPO (Meng, Xia and Chen, 2024) was framed partly as a fix for this; they normalise by sequence length and drop the reference model. It helps, but it does not eliminate the issue, because the underlying preference data still rewards verbosity.
The second is the reference model is doing more work than it looks. Push beta too low and the policy drifts arbitrarily far from the reference; the chosen and rejected logprobs both crash, and the model unlearns the formatting, refusal behaviour, and tool-use patterns you spent SFT installing. Push beta too high and you get nothing. The usable window for beta on a 7B model on instruction data has been roughly 0.05-0.3 for us, with the sweet spot data-dependent. PPO's KL penalty is more interpretable in this regard: you set a target KL and the adaptive coefficient holds it.
The third is on-policy data matters more than DPO admits. DPO learns from a fixed dataset of preference pairs that were generated by some other policy. The further your current policy drifts from the policy that generated the data, the less informative each pair becomes. PPO regenerates rollouts continuously, so it is always on-policy. The Tülu 3 work from Ai2 (Lambert et al., 2024) leaned into iterative DPO with on-policy generations precisely because off-the-shelf DPO on a static dataset plateaued well below what PPO-style approaches achieved.
Production note: the single highest-leverage operational change we made on DPO runs was not tuning beta. It was running a length-controlled eval at every checkpoint. Without it, you ship a model that "wins" on AlpacaEval and loses on every real user task because it is now 70% longer for no reason. If your eval harness does not normalise by length, your DPO runs are lying to you.
The variants worth knowing
DPO is now a family. Four variants come up in serious conversations, and each fixes a specific failure of vanilla DPO.
KTO (Ethayarajh et al., 2024) drops the requirement for paired preferences entirely. You give it desirable or undesirable single examples and it optimises a prospect-theory utility. This is enormous if your data is thumbs-up / thumbs-down telemetry from production - which describes most teams with a deployed product. Pair construction is a real expense and a real source of bias; KTO sidesteps both.
IPO, introduced as a special case of ΨPO by Azar et al. (2023), replaces DPO's log-sigmoid loss with a squared-error formulation that does not over-fit to deterministic preferences. In practice it is more robust when your preference labels are noisy - which they always are - but slightly less sharp when they are clean.
SimPO (Meng, Xia and Chen, 2024) drops the reference model, uses length-normalised log-probabilities as the implicit reward, and adds a target reward margin. Halves memory during training and addresses length bias directly. The catch is that without a reference, you have nothing keeping the policy near SFT behaviour, so you need stronger SFT and more careful early stopping.
PPO is still here. It is not deprecated; it is selectively useful, as we will name below.
The comparison table you actually want
| Method | Pipeline complexity | Compute cost (vs SFT) | Hyperparameter sensitivity | Mode collapse risk | Length bias | Preference data efficiency | Default at frontier labs |
|---|---|---|---|---|---|---|---|
| PPO-RLHF | High - 4 models, rollout loop, value head | 4-8x | High - KL target, clip, GAE, value LR | Medium (KL penalty helps) | Medium (reward-model dependent) | High (on-policy regen) | Used selectively |
| DPO | Low - 2 models, one loss | 1.5-2x | Medium - beta is the lever | Medium-low | High (the failure mode) | Medium (off-policy degrades) | Common default |
| KTO | Low - 2 models, unpaired data | 1.5-2x | Medium - desirable / undesirable weights | Low | Medium | Very high (uses thumbs data directly) | Where data is unpaired |
| IPO | Low | 1.5-2x | Low - robust to label noise | Low | Medium | Medium | Niche - noisy labels |
| SimPO | Lowest - 1 model, no reference | 1.0-1.3x | Medium - margin gamma matters | Medium-high (no anchor) | Low (length-normalised) | Medium | Cost-sensitive runs |
The "default at frontier labs" column is the noisiest one and worth caveating: labs do not publish their full recipes, and what leaks suggests most production systems are now hybrids - iterative DPO with on-policy generation, sometimes followed by a short PPO phase on verifiable rewards, sometimes with KTO on thumbs data layered in. Pure single-algorithm pipelines are a 2023 phenomenon.
When PPO still wins
We have shipped PPO twice in the last year. Both times the call was easy.
- Verifiable rewards. If your reward signal is a real function (unit tests pass, the math answer is correct, the SQL query returns the right row), there is no reward model to over-fit and the gradient information per rollout is dense. Ai2's RLVR work and DeepSeek-R1's success on math / code are the public face of this. DPO cannot use this signal directly without manufacturing pairs, and the manufactured pairs throw away most of the information.
- You need behaviour that does not exist in your SFT distribution. DPO is bounded by what the reference can do; it sharpens, it does not explore. If you want the model to learn a tool-use pattern, a reasoning trace style, or a refusal that the SFT model never produces, PPO's sampling loop can find it. DPO cannot upweight a behaviour it has never seen with non-zero probability.
- Long-horizon credit assignment. Multi-turn agent traces where the reward comes at the end of a 20-step interaction are PPO territory. DPO's per-pair loss has no story for assigning credit across turns.
Everywhere else - the bulk of instruction-tuning, persona work, refusal calibration, format adherence - DPO and its variants dominate on the cost / result curve.
The DPO loss, derived (skip if you have seen it) Start from the KL-constrained RLHF objective: maximise the expected reward of the policy minus beta times the KL divergence from the reference policy. This has a closed-form solution where the optimal policy is proportional to the reference policy times exp(reward / beta). Rearrange for the implicit reward and plug into a Bradley-Terry preference model where the probability that the chosen response beats the rejected response is sigmoid of the reward difference. The partition function cancels in the difference. The negative log-likelihood becomes:L_DPO = -E[log sigma(beta * log(pi_theta(y_w|x) / pi_ref(y_w|x))
- beta * log(pi_theta(y_l|x) / pi_ref(y_l|x)))]
The model implicitly is the reward model. That is the whole trick - and the source of length bias, because the log-ratio scales with sequence length.
How we would choose now
A decision procedure rather than a flowchart, because most real projects have constraints that pre-empt half the tree.
- If your preference data is unpaired thumbs-up / down telemetry, start with KTO. You will avoid the synthetic pair-construction step entirely and the loss matches the data shape.
- If your preference data is paired but noisy (low inter-annotator agreement), use IPO over DPO. The squared-error loss is more robust; expect a small ceiling sacrifice for a meaningful floor improvement.
- If you are compute-constrained and care about length, use SimPO. Make sure your SFT is strong, because there is no reference anchor.
- For the default case - paired preferences, reasonable label quality, standard instruction-tuning - use DPO with iterative on-policy refresh. Run for 1-2 epochs, beta in [0.1, 0.3], length-controlled eval at every checkpoint, early stop on the eval not on the train loss.
- Reach for PPO when you have a verifiable reward, when you need exploration beyond the SFT distribution, or when the reward signal is long-horizon. Budget for the operational overhead honestly - 2-3x the engineer time of a DPO run, easily.
- Hybrid is the frontier-lab move. Iterative DPO to get most of the way, then a short PPO phase on the residual behaviours that matter most. We are increasingly doing this and we suspect everyone serious is.
The TRL library ships clean implementations of DPO, KTO, GRPO and SFT trainers, and the Hugging Face DPO write-up (Rasul, Belkada and von Werra, 2023) is still the fastest path to a working baseline. Start there, measure the length bias on day one, and only reach for PPO when the structure of your reward forces your hand. The simplicity pitch was directionally right. The fine print is where the work lives.