← Concept library

Applied LLMs

Multi-Objective Alignment

Multi-objective alignment trains a single language model to satisfy several competing criteria simultaneously by navigating the Pareto front of reward trade-offs rather than collapsing them into one scalar.

advanced · 8 min read · Premium

A model trained to maximise helpfulness alone will happily explain how to synthesise dangerous chemicals if the question is phrased politely. A model trained to maximise harmlessness alone will refuse most interesting requests. The tension is not a training bug; it is a fundamental property of the objective landscape. Multi-objective alignment is the set of methods that take this tension seriously rather than sweeping it into a single weighted sum and hoping for the best.

Why one reward signal is not enough

Standard RLHF collapses all desiderata into one reward model trained on human preference comparisons. The implicit assumption is that a single number can rank every possible response on a dimension that conflates helpfulness, factual accuracy, safety, tone, length, and legal compliance. In practice this works tolerably for average queries but breaks at the extremes: a verbosely correct but unsafe answer can outscore a terse safe one, depending on which annotators happened to label which pairs.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied