DPO in Practice

The problem with PPO-based RLHF is not theoretical: it is operational. You need to train a reward model, keep a frozen reference policy in GPU memory alongside the live policy, sample from the policy during training, run a KL-penalised RL update, and tune at least four hyperparameters that interact badly. For a 70B model this translates to weeks of engineering before you see a single useful gradient. DPO (Direct Preference Optimisation, Rafailov et al. 2023) collapses that entire pipeline into a single binary cross-entropy pass over preference pairs.

The Maths in One Screen

Standard RLHF maximises a KL-penalised reward objective:

max_π  E[r(x, y)] - β · KL[π(·|x) || π_ref(·|x)]

Solving this analytically yields the optimal policy in closed form:

π*(y|x)  ∝  π_ref(y|x) · exp( r(x, y) / β )

Rafailov et al. inverted this: instead of training r first and then optimising π*, they expressed r in terms of the log-ratio between policy and reference, then substituted back into the Bradley-Terry preference model. The reward disappears from the computation graph entirely, leaving a loss that depends only on the log-probabilities the model assigns to each response:

L_DPO(θ) = -E_{(x, y+, y-)} [
  log σ(
    β · (log π_θ(y+|x) / π_ref(y+|x))
      - β · (log π_θ(y-|x) / π_ref(y-|x))
  )
]

where σ is the sigmoid, y+ is the preferred response, y- is the rejected one, and β controls how tightly the policy is anchored to the reference. A large β (0.5 and above) keeps the policy close to the reference; a small β (0.01-0.05) allows larger deviation.

In practice, the loss presses the model to widen the implicit reward margin between y+ and y- relative to how the reference model scores them. Training metrics worth watching:

Metric	What to look for
`rewards/chosen`	Should rise steadily
`rewards/rejected`	Should fall or stay flat
`rewards/margins`	The key health signal; should widen
`rewards/accuracies`	Fraction where chosen > rejected; aim for >0.7

Setting Up a DPO Run with TRL

TRL's DPOTrainer is the standard entry point. Each dataset example needs three fields: a prompt, a chosen completion, and a rejected completion. Conversational format works natively; the trainer applies the chat template automatically.

from trl import DPOConfig, DPOTrainer
from datasets import load_dataset
from peft import LoraConfig

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = DPOTrainer(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    ref_model=None,           # uses initial model weights as reference
    args=DPOConfig(
        beta=0.1,             # KL penalty weight
        loss_type="sigmoid",  # standard DPO
        learning_rate=1e-6,   # lower than SFT; DPO is sensitive here
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        max_length=1024,
        bf16=True,
    ),
    train_dataset=dataset,
    peft_config=LoraConfig(r=64, lora_alpha=128, target_modules="all-linear"),
)
trainer.train()

When ref_model=None, TRL freezes a copy of the initial weights as the reference. When training with PEFT/LoRA, the frozen base is the reference and only the adapter is updated, which halves memory overhead compared to holding two separate full models.

The Maths in One Screen

Setting Up a DPO Run with TRL

Keep reading with Pro.