Multi-Objective Alignment

A model trained to maximise helpfulness alone will happily explain how to synthesise dangerous chemicals if the question is phrased politely. A model trained to maximise harmlessness alone will refuse most interesting requests. The tension is not a training bug; it is a fundamental property of the objective landscape. Multi-objective alignment is the set of methods that take this tension seriously rather than sweeping it into a single weighted sum and hoping for the best.

Why one reward signal is not enough

Standard RLHF collapses all desiderata into one reward model trained on human preference comparisons. The implicit assumption is that a single number can rank every possible response on a dimension that conflates helpfulness, factual accuracy, safety, tone, length, and legal compliance. In practice this works tolerably for average queries but breaks at the extremes: a verbosely correct but unsafe answer can outscore a terse safe one, depending on which annotators happened to label which pairs.

The deeper issue is that human preferences are not consistent across contexts. A medical professional asking about drug interactions has legitimately different needs from an anonymous user asking the same question. A single scalar reward cannot represent this context-dependence; it averages over population heterogeneity and loses the variation entirely.

Formally, suppose we have \(k\) objectives \(r_1, \ldots, r_k\) (e.g., helpfulness, harmlessness, honesty). The standard approach produces a single reward \(r = \sum_i w_i r_i\) with fixed weights \(w_i\). The problem: any fixed \(w\) encodes a specific value judgement about trade-offs before training, and the resulting policy cannot be cheaply adapted to a different trade-off at inference time without retraining.

Multi-objective alignment instead attempts to learn a representation of the entire Pareto front, so that a specific operating point can be selected at deployment.

MODPO: folding multiple objectives into DPO

Multi-Objective Direct Preference Optimization (MODPO) extends DPO to handle \(k\) reward signals without reinforcement learning. The key insight is that each additional objective beyond the primary one can be treated as a margin constraint: a response must not only be preferred on the primary criterion, it must also clear a threshold on every secondary criterion.

MODPO trains the language model as an implicit collective reward model. Given a set of per-objective preference datasets \(\mathcal{D}_1, \ldots, \mathcal{D}_k\), the method combines them into a joint objective whose solution is theoretically equivalent to multi-objective RLHF but requires roughly three times less compute, because no separate RL training loop is needed. Empirical results on safety alignment and long-form QA show that MODPO traces out a Pareto front of policies indexed by the weight vector \(w\), allowing post-hoc selection of the operating point without retraining (Zhou et al., ACL Findings 2024, arXiv:2310.03708).

A simplified view of the MODPO loss for two objectives:

Why one reward signal is not enough

MODPO: folding multiple objectives into DPO

Keep reading with Pro.