KTO: Unpaired Preference Learning

Collecting preference data is a bottleneck in practice. DPO and RLHF both require pairs - a chosen response and a rejected response for the same prompt. That pairing requirement is harder to satisfy than it sounds: annotators must evaluate two responses simultaneously, and even small timing or framing differences between the two can introduce noise. In many real deployments, you already have logs of model outputs with thumbs-up/thumbs-down ratings, but those ratings were never collected as paired comparisons. They are inherently unpaired.

Kahneman-Tversky Optimisation (KTO), introduced by Ethayarajh et al. at ICML 2024, is built precisely for this regime. It trains directly on binary desirability signals - each example is simply (prompt, response, label) where label is either desirable or undesirable - and matches or exceeds DPO's performance on models from 1B to 30B parameters.

The prospect theory framing

The name is not decorative. Kahneman and Tversky's prospect theory models how humans actually perceive utility under uncertainty, rather than how a rational agent would. Two key properties matter here:

Reference-point sensitivity: people evaluate outcomes relative to a reference point, not in absolute terms.
Loss aversion: losses loom larger than equivalent gains.

KTO argues that existing alignment objectives implicitly encode some of these biases, and that making them explicit - through a Human-Aware Loss function (HALO) - leads to better behaviour. The framework unifies several existing objectives: you can derive DPO, IPO, and related methods as special cases of the HALO family, each with different implicit utility functions.

What makes KTO distinct is that it picks the utility function that actually matches the prospect theory literature, rather than one that merely happens to train well.

The loss function

Each training example has a prompt x, a response y, and a binary label. For a desirable example the loss is:

L_KTO = λ_D * σ( β * (r_θ(x,y) - z₀) )

For an undesirable example the loss is:

L_KTO = λ_U * σ( β * (z₀ - r_θ(x,y)) )

Where:

Symbol	Meaning
`r_θ(x,y)`	log π_θ(y\|x) - log π_ref(y\|x), the implicit reward
`z₀`	KL reference point: E[log π_θ(y'\|x) - log π_ref(y'\|x)] over sampled y'
`β`	KL penalty coefficient; controls how far policy can stray from reference
`λ_D`, `λ_U`	loss weights for desirable and undesirable examples respectively
`σ`	sigmoid function

The z₀ term is what makes this a reference-point model in the prospect theory sense. Rather than measuring reward in absolute terms, KTO measures reward relative to the expected reward under the current policy on the same prompt. A response that looks good in isolation but is merely average for that prompt does not receive a strong positive signal.

The prospect theory framing

The loss function

Keep reading with Pro.