Synthetic Preference Data

Getting a model to prefer helpful, honest, and harmless responses over harmful ones requires a reward signal. The traditional path, collecting tens of thousands of human pairwise comparisons, costs roughly $0.50 to $5 per comparison at scale, takes weeks to run, and produces labels that vary by annotator mood, culture, and attention span. By 2022, several teams had independently reached the same question: could a capable language model substitute for at least some of those human judgements? The answer is yes, with important caveats, and the machinery for doing it reliably is now a core part of modern alignment pipelines.

What makes a preference label "synthetic"

In standard RLHF, a human is shown a prompt and two model completions (A and B) and picks the better one. That choice trains a reward model, which then guides policy optimisation via PPO or a simpler contrastive loss.

Synthetic preference data replaces the human in that loop with a language model. The judge model reads the same prompt and both completions, then outputs either a scalar score or an explicit verdict ("A is better because..."). The resulting (prompt, chosen, rejected) triples are structurally identical to human-labelled data and can feed directly into the same reward-model training pipeline.

Three broad strategies exist:

Strategy	How labels are produced	Typical use
Pointwise scoring	Judge assigns a score to each completion independently; higher score becomes "chosen"	Fast, parallelisable; less sensitive to subtle differences
Pairwise comparison	Judge sees both completions and picks one	Closer to human annotation setup; more reliable for fine-grained quality
LLM-as-jury	Multiple judge models vote; majority wins	Reduces single-model bias; expensive

The pairwise setup is most common. A typical prompt to the judge looks like:

System: You are an impartial evaluator. Given a user request and two
responses, decide which response better satisfies the request.
Respond with exactly "Response A" or "Response B".

User request: {prompt}
Response A: {completion_a}
Response B: {completion_b}

Adding chain-of-thought instructions ("First reason step by step, then give your verdict") measurably improves label quality at the cost of slower inference.

The RLAIF pipeline in practice

RLAIF (Reinforcement Learning from AI Feedback) simply denotes RLHF where the feedback comes from a model rather than a human. The full pipeline runs roughly as follows:

Sample completions. For each prompt, sample two or more completions from the current policy (or a mixture of policies at different checkpoints).
Score with the judge. Pass each (prompt, completion_a, completion_b) triple to the judge LLM. Collect the verdict and, if requested, the rationale.
Train or update the reward model. Use the labelled pairs to train a Bradley-Terry reward model with the standard binary cross-entropy loss. If you want to skip reward-model training entirely, "direct RLAIF" obtains reward values in real time from the judge at PPO step time, though this is expensive.
Run RL. Use the reward model (or direct rewards) with a KL penalty anchored to the supervised fine-tuning (SFT) checkpoint.

Google's RLAIF paper (Lee et al., 2023) found that on summarisation and dialogue tasks, AI-labelled feedback matched human-labelled feedback almost exactly in final policy quality, and in some conditions exceeded it.

Constitutional AI: closing the loop with principles

Anthropic's Constitutional AI work (Bai et al., 2022) pushed this further by making the judge's criteria explicit. Instead of vague "helpfulness and harmlessness" instructions, a written constitution enumerates specific principles: avoid supporting illegal activity, do not demean individuals, prefer the most helpful response, and so on.

The training has two phases:

Supervised phase (CAI-SL): The model critiques its own harmful response against a randomly sampled constitutional principle, then revises it. These (original, revised) pairs fine-tune the SFT checkpoint. No human labels, only the model's own critiques guided by the constitution.

RL phase (CAI-RL / RLAIF): The model is asked to compare two responses and select the one that better respects a given principle. These AI-generated pairwise labels train a preference model (PM), which then guides RLHF in the normal way.

The constitutional loop is significant because it injects explicit, auditable human values into the preference data generation process, rather than hoping the judge internalises them from pre-training.

Rejection sampling as a lightweight alternative

Before a full RLHF pass is practical, rejection sampling offers a simpler path to high-quality preference signal. The procedure:

For each prompt, sample k completions from the current policy.
Score all k completions with a reward model (or a judge LLM).
Keep only the highest-scoring completion; discard the rest.
Fine-tune on the surviving set (SFT on best-of-k).

This is sometimes called rejection sampling fine-tuning (RFT). It is less sample-efficient than PPO but far more stable, requires no RL infrastructure, and can be iterated: train on the filtered set, re-sample, filter again, repeat.

The Llama 2 paper (Touvron et al., 2023) used iterative rejection sampling alongside PPO in their RLHF pipeline. Self-Rewarding Language Models (Yuan et al., 2024) pushed this further: the policy itself generates both the candidate responses and the reward judgements in each DPO iteration, eliminating the separate judge model entirely.

When it falls down

Position bias. LLM judges show a systematic preference for whichever response appears first (or second) in the prompt, independent of quality. Studies have measured this effect flipping judge verdicts 10-30% of the time. Mitigation: swap the order of A and B across two separate calls and average the verdicts.

Length bias. Judges, and the models fine-tuned on their labels, develop a preference for longer responses regardless of content quality. Reward hacking ensues: the policy learns to pad outputs, not improve reasoning. Mitigation: add an explicit length-normalisation penalty to the reward or instruct the judge to penalise unnecessary verbosity.

Self-enhancement bias. A model used as judge will systematically prefer outputs resembling its own style and vocabulary. When the policy and judge share the same base weights (as in self-rewarding setups), this feedback loop can entrench idiosyncratic behaviours rather than improve actual quality.

Model Autophagy Disorder (MAD). When successive training generations rely entirely on synthetic data from previous generations, without injecting fresh real data, diversity degrades even if pointwise quality appears stable. Alemohammad et al. (2023) formalised this as a "self-consuming loop" and showed that model distributions contract over generations, dropping rare but valid outputs. Practical implication: synthetic preference pipelines need periodic re-grounding against real human labels or held-out human evaluations.

Reward model overoptimisation. The KL penalty in PPO limits, but does not eliminate, the tendency for the policy to exploit blind spots in the reward model. Synthetic labels may introduce correlated errors that a diverse human workforce would not, making the reward model easier to game.

Capability ceiling. A judge model cannot reliably detect errors in domains where it is itself incompetent. Factual hallucinations in specialist medicine, subtle mathematical errors, and code bugs that pass surface plausibility checks all slip through. Human spot-checking in high-stakes domains remains necessary.

What makes a preference label "synthetic"

The RLAIF pipeline in practice

Constitutional AI: closing the loop with principles

Rejection sampling as a lightweight alternative

When it falls down

Further reading