Offline RL and the DPO Connection

The 2023 DPO paper opens with a quietly devastating observation: every RLHF practitioner is already solving a constrained optimisation problem that has a known closed-form solution. They just didn't realise it, so they were using PPO instead.

That mismatch explains both the elegance of DPO and its limitations. Understanding the gap between them requires knowing exactly where offline RL ends and where online RL begins.

The objective everyone is actually solving

Standard RLHF maximises expected reward while penalising divergence from a reference policy:

max_π  E_{x~D, y~π} [ r(x, y) ]  -  β · KL[ π(y|x) ‖ π_ref(y|x) ]

Here x is a prompt, y is a completion, r is the reward model, π_ref is the supervised fine-tuned (SFT) baseline, and β controls how far the optimised policy is allowed to drift. This is precisely the KL-regularised RL objective covered in the foundational treatment of RLHF-as-RL.

The key fact, known from the KL-constrained optimisation literature since at least Ziegler et al. (2019), is that this objective admits a closed-form optimal policy:

π*(y|x)  ∝  π_ref(y|x) · exp( r(x, y) / β )

This says: take the reference distribution and re-weight each completion by its exponentiated reward, normalised over all completions. The partition function Z(x) = Σ_y π_ref(y|x) · exp(r(x,y)/β) is intractable to compute directly for autoregressive sequences, which is exactly why practitioners turned to PPO.

DPO's insight is to invert this relationship. If the optimal policy is π*, then the reward implied by π* is:

r(x, y)  =  β · log[ π*(y|x) / π_ref(y|x) ]  +  β · log Z(x)

Plug this reparameterisation into the Bradley-Terry preference model (the standard assumption that human annotators prefer y_w over y_l with probability σ(r(x,y_w) - r(x,y_l))), and log Z(x) cancels out in the difference. The resulting training objective is:

L_DPO(π_θ) = -E_{(x, y_w, y_l) ~ D} [
    log σ( β · log[π_θ(y_w|x) / π_ref(y_w|x)]
           - β · log[π_θ(y_l|x) / π_ref(y_l|x)] )
]

This is a binary cross-entropy loss over preference pairs. No reward model. No rollouts. No PPO update loop. The policy network π_θ is simultaneously the reward model, encoded implicitly in its log-ratio against π_ref.

Why this is offline RL

Offline RL (sometimes called batch RL) refers to any approach that learns a policy entirely from a fixed dataset of transitions collected under some other behaviour policy, with no further environment interaction. The contrast is online RL, where the agent collects new data by acting in the environment during training.

DPO is offline RL in an almost literal sense:

Property	Online RLHF (PPO)	DPO
Rollouts during training	Yes, samples from `π_θ` at each step	No, uses the static preference dataset
Reward model required at train time	Yes	No (reward is implicit in `π_θ`)
Policy interacts with reward signal	Yes, live reward feedback	No, reward folded into loss
Data distribution	Shifts as policy improves	Fixed (collected under SFT or earlier policy)
Convergence guarantee	PPO's clipped surrogate, no global guarantee	Binary cross-entropy, well-behaved gradients

The objective everyone is actually solving

Why this is offline RL

Keep reading with Pro.