Length Bias and Verbosity Control

A single-number reward from a human annotator cannot distinguish "this response is better because it is more thorough" from "this response feels better because it is longer." That ambiguity compounds across tens of thousands of preference labels, and the reward model absorbs it as a near-universal rule: longer text is preferred text. An experiment published at COLM 2024 (Singhal et al.) found that a reward function built from nothing but output length reproduced most of the quality gains typically attributed to RLHF. That is an uncomfortable finding for a field that runs reward models as proxies for human judgement.

Why human annotators over-reward length

Preference annotation is cognitively expensive. Judging whether two responses are equally accurate requires domain knowledge and careful reading. Judging which response looks more comprehensive is much faster and feels reasonable. Humans systematically prefer "more complete-seeming" answers, and the reward model learns that signal faithfully.

Three reinforcing effects compound the bias:

Effect	Mechanism	Result
Coverage heuristic	Annotators interpret length as proxy for completeness	Reward model assigns higher scores to longer completions
Presentation over substance	A polished but inaccurate long answer beats a terse correct one	Factual quality is under-weighted
Optimiser exploit	PPO gradient ascent finds length as a low-resistance path to higher reward	Policy learns verbosity rather than quality

Once the reward model has absorbed these biases from preference data, the policy optimiser does exactly what it should: it maximises expected reward. The result is that the trained policy is incentivised to pad, hedge, restate, and over-explain.

How length bias propagates through the pipeline

The effect is not uniform; it cascades across training stages.

Preference data collection. A dataset where annotator A consistently prefers longer responses for domain X will create a reward model that over-weights length in that domain, even if the original preference was based on something else entirely.

Reward model training. Reward models are typically trained as binary classifiers over (chosen, rejected) pairs. Length correlates with the chosen label strongly enough that the model encodes it as a primary signal, not as one factor among many. This is made worse by the fact that most reward models are trained on a mix of domains; domain-specific quality signals are weaker, but length is always present.

PPO optimisation. Policy gradient methods will exploit whatever gradient the reward surface offers. Length is differentiable (in expectation), easy to increase, and reliably rewarded. The model discovers that adding a summary paragraph, repeating key points, or inserting a caveat-heavy disclaimer raises the reward signal almost for free.

Automatic evaluation. GPT-4-based evaluators inherit the same bias when used as judges. AlpacaEval's original win-rate metric correlates strongly with output length; Dubois et al. (2024) showed that length-controlled AlpacaEval improved correlation with LMSYS Chatbot Arena from 0.94 to 0.98 by regressing out the effect of length difference.

The bias does not stay in one corner of the pipeline. It flows from labeller psychology into reward model weights into policy behaviour into evaluation, completing a self-reinforcing loop.

Mitigations used in practice

No single fix is universally reliable. Practitioners combine several approaches, each addressing a different point in the pipeline.

Reward model debiasing. Singhal et al. show that retraining the reward model with length-balanced data is the most effective single intervention. Concretely: if the preference dataset shows that longer responses were chosen at a rate 15% above what length-stratified sampling would predict, resample or re-weight to remove that excess.

Length-normalised reward. Divide or subtract a length term from the scalar reward before passing it to PPO. A simple formulation is:

r_adjusted = r_raw - λ · (len(response) / len(reference))

where λ is a tuned coefficient and reference is a baseline length (e.g., the SFT model's median output length). This is easy to implement but sensitive to λ: too large, and the model learns to be cryptically terse; too small, and the bias persists.

Separate reward head for length. Train the reward model to predict two scalars: a quality score and a length score. Optimise the policy against the quality score only, while monitoring the length score as a diagnostic. This separates the objective from the nuisance variable.

DPO with length-balanced pairs. DPO directly optimises the policy against contrastive pairs without a separate reward model. If the training pairs are length-balanced (chosen and rejected responses have similar lengths), the policy cannot exploit length as a shortcut. This requires careful curation of the dataset.

Constitutional or rule-based length penalties. Some production systems add a rule that fires if a response exceeds N tokens, directly penalising the reward or adding a stop token. Crude but interpretable; useful as a guardrail when other mitigations are insufficient.

Evaluator corrections. For benchmarking, the length-controlled AlpacaEval approach: fit a generalised linear model on preference data, predict what the preference would be at identical lengths for both candidates, and use that counterfactual win-rate as the benchmark score.

When it falls down

Length-penalising under-counts genuinely complex answers. Some questions legitimately require long responses. A model penalised too aggressively for length will give terse answers to complex multi-step problems and score poorly on tasks where depth matters. The optimal response length is query-dependent.

Balancing preference data is hard to generalise. Resampling by length works within a domain, but different domains have different natural response lengths. A balanced technical documentation response may be 600 tokens; a balanced conversational response may be 80 tokens. A single global resampling scheme can distort the distribution of topics.

Verbosity can shift form, not disappear. A model trained with a global length penalty sometimes learns to compress without improving: shorter sentences, more jargon, less explanation. The token count drops, but the response becomes less useful to the reader. Length is a proxy for verbosity; verbosity is a proxy for the underlying quality failure.

Evaluation circular logic. If the benchmark itself is biased (AlpacaEval original, GPT-4-as-judge without controls), then the mitigation that maximises that benchmark metric may re-introduce length bias in a subtler form. The Dubois et al. correction helps, but no evaluator is fully deconfounded.

Adversarial stability. A reward model retrained on length-balanced data will have lower length bias, but it can still be gamed. If a policy optimiser continues training beyond the point of reward model accuracy (i.e., reward hacking), new spurious correlates get exploited. Length was just the first and most obvious.

Why human annotators over-reward length

How length bias propagates through the pipeline

Mitigations used in practice

When it falls down

Further reading