← Concept library

Applied LLMs

Building a Preference Dataset

A preference dataset pairs model outputs and records which one a human (or AI judge) preferred, providing the training signal that separates a helpful assistant from a raw base model.

intermediate · 8 min read

The entire post-training alignment pipeline rests on one fragile foundation: a collection of (prompt, chosen response, rejected response) triples. Get those triples wrong and every subsequent step, reward modelling, RLHF, DPO, goes wrong with them. InstructGPT (Ouyang et al., 2022) trained its reward model on roughly 50,000 comparison pairs collected from a team of contractors; a 1.3B parameter model fine-tuned on that signal outperformed the raw 175B GPT-3 on human evaluations. That ratio, tiny labelled dataset beating hundred-billion-parameter raw scale, is the economic argument for preference data.

What a preference record actually contains

A single preference example has three parts:

{
  "prompt":   "Explain gradient descent to a 10-year-old.",
  "chosen":   "Imagine you're hiking and want to find the lowest valley...",
  "rejected": "Gradient descent is an iterative optimisation algorithm..."
}

The chosen and rejected fields are ranked, not scored. The only claim the label makes is "given this prompt, a human preferred A over B." There is no absolute quality score. That relativity is both a strength (much easier to elicit than calibrated Likert ratings) and a source of systematic error (more on that in the failure-modes section).

Modern datasets often add metadata: the annotator's confidence, the criteria used (helpfulness vs. harmlessness vs. honesty), and which model generated each response. The Anthropic HH-RLHF dataset, for instance, contains roughly 161,000 training pairs covering both helpfulness and harmlessness dimensions, with the two axes collected and labelled separately.

Four collection strategies

1. Human pairwise comparison

Annotators see a prompt and two model outputs side by side, then select the better one. This is the gold-standard source used in InstructGPT and Llama 2. Llama 2's preference data included multiple annotators per pair, with the majority vote used for training and disagreement rates tracked as a data-quality signal. The cost per annotation is typically $0.10-$1.00 depending on task complexity and annotator expertise.

2. Scaled rating then conversion

Some pipelines collect 1-5 Likert ratings per response, then derive pairs by taking responses with rating gap >= 1 as (chosen, rejected). This gives more signal per prompt but introduces noise at the conversion step: the model may treat a gap between 4 and 5 identically to a gap between 1 and 5, which they are not.

3. AI feedback (RLAIF)

A capable LLM (often a larger version of the model being trained, or a separate judge model) provides preference labels instead of humans. Lee et al. (2023) showed RLAIF matches RLHF performance on summarisation and helpful-chat tasks while dramatically cutting cost. The risk is circular: a biased judge model bakes its own biases into the training set.

4. Constitutional / rule-based filtering

Anthropic's Constitutional AI approach generates a large set of response pairs, then uses a language model to critique each response against a fixed set of principles (the "constitution") and selects the pair member that better satisfies those principles. This scales cheaply but encodes the constitution author's values, not necessarily a broader human consensus.

Anatomy of a quality-controlled annotation pipeline

A production pipeline has six stages:

Stage What happens Key quality lever
Prompt sampling Sample diverse prompts from target distribution Avoid overfit to one task type
Response generation Sample K responses per prompt (K=2-4 typical) Temperature and sampling diversity
Annotator assignment Assign each pair to N annotators (N >= 2) Enables inter-annotator agreement (IAA) measurement
Label collection Annotators choose preferred, note criteria Structured rubric reduces ambiguity
IAA filtering Discard or flag pairs with low agreement Cohen's kappa > 0.4 is a common threshold
Deduplication Remove near-duplicate prompts Avoids reward hacking on repeated patterns

Llama 2's team reported that for the most safety-sensitive categories, they required higher annotator agreement before including a pair in training, effectively implementing variable-IAA thresholds by domain.

A compact pseudocode sketch of the pairing logic:

def build_preference_pair(prompt, responses, annotator_votes):
    # responses: list of strings, annotator_votes: list of indices
    counts = Counter(annotator_votes)
    if counts.most_common(1)[0][1] / len(annotator_votes) < AGREEMENT_THRESHOLD:
        return None  # discard low-confidence pair
    chosen_idx = counts.most_common(1)[0][0]
    rejected_idx = [i for i in range(len(responses)) if i != chosen_idx][0]
    return {
        "prompt": prompt,
        "chosen": responses[chosen_idx],
        "rejected": responses[rejected_idx],
    }

Prompt distribution: the invisible design choice

The prompts you collect comparisons on define what your reward model (and ultimately your policy) cares about. Biases here are harder to detect than annotation errors because they never show up in inter-annotator agreement statistics.

Common prompt-distribution failure modes:

  • Task skew: if 60% of prompts are creative writing, the reward model will be a poor judge of factual accuracy in coding tasks.
  • Difficulty concentration: easy prompts where both responses are reasonable produce near-coin-flip labels. Informative pairs come from prompts where the model's outputs genuinely differ in quality.
  • Distribution shift: if prompts are sampled from the current model's usage logs, the dataset goes stale as the policy improves. Iterative data collection (weekly cadence, as Anthropic did in Bai et al., 2022) addresses this.

The practical recommendation: sample prompts from a broad template library, then filter to prompts where initial response quality is mixed (one strong, one weak) rather than uniformly good or bad.

Chosen-rejected margin and what it buys you

Not all preference pairs carry equal training signal. A pair where one response is factually wrong and the other is correct is much more informative than a pair where both responses are good but one is marginally more concise. Some practitioners explicitly track the quality margin, typically measured as the difference in gold-standard scores when available.

Training on low-margin pairs can degrade reward model calibration: the model learns small stylistic distinctions that do not generalise. Filtering to high-confidence, high-margin pairs often produces a smaller but more effective dataset.

DPO and related methods are especially sensitive to margin quality because they operate directly on the log-probability ratio of chosen vs. rejected without a separate reward model to absorb noise.

When it falls down

Annotator preference != task quality. Annotators systematically favour longer, more confident-sounding responses even when shorter ones are more accurate. This is sometimes called verbosity bias and has been observed in both human and LLM annotators. A reward model trained on such data learns to be verbose, not correct.

Label noise compounds downstream. A 10% label error rate in the preference dataset can meaningfully degrade the reward model, which then provides a noisy training signal for RLHF, which can cause policy collapse during PPO. The errors do not average out; they propagate.

Gaming the rubric. If annotators are given explicit rubrics ("prefer the response that is more helpful"), models can overfit to surface features that score well on those rubrics without being genuinely helpful. This is one route to reward hacking.

Cold-start distribution mismatch. If responses are generated by a weak base model, all pairs may be low quality. Adding some human-written "gold" responses as the chosen side for a fraction of pairs, a technique used in InstructGPT's SFT data, can raise the quality ceiling.

Scale does not substitute for coverage. 100,000 pairs concentrated on a narrow prompt distribution produce a worse reward model than 10,000 pairs that cover the actual task space well. Coverage and diversity matter more than raw count once you have passed a minimum threshold (roughly in the thousands for single-domain tasks).

Further reading

Sign in to save and react.
Share Copied