Applied LLMs
ORPO and Reference-Free Alignment
ORPO collapses supervised fine-tuning and preference alignment into a single training phase by appending a log-odds-ratio penalty directly to the NLL loss, removing the need for a reference model.
intermediate · 8 min read · Premium
Standard alignment pipelines require two separate training phases and two copies of a model in memory: one being trained, one frozen as a reference. ORPO (Odds Ratio Preference Optimisation), introduced by Hong, Lee, and Thorne in March 2024, folds both phases into one pass and eliminates the reference model entirely. On the UltraFeedback dataset, a 7B Mistral fine-tuned with ORPO reached 12.20% on AlpacaEval 2.0 and 7.32 on MT-Bench, outperforming several models in the 13B parameter class.
Why the Reference Model Exists - and Why It Is Costly
In RLHF and DPO-family methods, the reference model serves as a KL-divergence anchor. The training signal is not just "score the chosen response higher"; it is "score the chosen response higher while not drifting too far from what the pre-trained model would have said." Without this anchor, models collapse to repetitive safe outputs or diverge in harmful directions.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.