ORPO and Reference-Free Alignment

Standard alignment pipelines require two separate training phases and two copies of a model in memory: one being trained, one frozen as a reference. ORPO (Odds Ratio Preference Optimisation), introduced by Hong, Lee, and Thorne in March 2024, folds both phases into one pass and eliminates the reference model entirely. On the UltraFeedback dataset, a 7B Mistral fine-tuned with ORPO reached 12.20% on AlpacaEval 2.0 and 7.32 on MT-Bench, outperforming several models in the 13B parameter class.

Why the Reference Model Exists - and Why It Is Costly

In RLHF and DPO-family methods, the reference model serves as a KL-divergence anchor. The training signal is not just "score the chosen response higher"; it is "score the chosen response higher while not drifting too far from what the pre-trained model would have said." Without this anchor, models collapse to repetitive safe outputs or diverge in harmful directions.

The standard DPO loss reflects this:

\[\mathcal{L}_\text{DPO}(\theta) = -\mathbb{E} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)} \right) \right]\]

where y_w is the preferred (winning) response and y_l is the dispreferred (losing) response. The reference log-probabilities π_ref are computed by a frozen copy of the model at every forward pass. For a 7B model in bf16, that is roughly 14 GB of VRAM just for the reference, plus the trainable copy, plus activations.

A further consequence: because DPO presupposes a well-initialised policy, practitioners must run a full SFT phase first to get the model onto the correct distribution. That is two training jobs, two datasets (SFT corpus + preference pairs), and two sets of hyperparameters to tune.

The ORPO Objective

ORPO's insight is that SFT supervision is already doing most of the work. The negative log-likelihood (NLL) loss over chosen responses forces the model to imitate good outputs. What SFT lacks is a contrast signal: it never says "and do not produce this." ORPO adds that signal directly, via a log-odds ratio term appended to the SFT loss:

\[\mathcal{L}_\text{ORPO} = \mathcal{L}_\text{NLL} + \lambda \cdot \mathcal{L}_\text{OR}\]

The odds ratio component is:

\[\mathcal{L}_\text{OR} = -\mathbb{E} \left[ \log \sigma \left( \log \frac{\text{odds}_\theta(y_w \mid x)}{\text{odds}_\theta(y_l \mid x)} \right) \right]\]

where odds for a sequence is defined as:

\[\text{odds}_\theta(y \mid x) = \frac{P_\theta(y \mid x)}{1 - P_\theta(y \mid x)}\]

No frozen model. No KL term. The ratio is computed purely from the current policy. The NLL loss on chosen responses handles style adaptation; the odds ratio term penalises the model for assigning high probability to rejected responses relative to chosen ones.

The scalar λ (called beta in the TRL implementation, defaulting to 0.1) controls how strongly the rejection penalty is weighted. A value of 0.1 means the preference signal is kept deliberately mild, consistent with the paper's argument that "a minor penalty for the disfavoured generation style is sufficient."

Why the Reference Model Exists - and Why It Is Costly

The ORPO Objective

Keep reading with Pro.