Why Synthetic Data

Annotating one million question-answer pairs by hand would take a team of ten experienced labellers roughly eight years at full-time pace. GPT-3.5 can produce the same volume in under two days at a fraction of the cost. That gap - between the cost of human labelling and the cost of model-generated labelling - is the entire business case for synthetic data.

The core problem synthetic data solves

Every supervised learning pipeline eventually hits the same wall: labelled data runs out before the model saturates. The pre-training corpus for a large language model can contain trillions of tokens scraped from the web, but fine-tuning for a specific capability (instruction following, step-by-step reasoning, domain Q&A) requires curated examples with explicit outputs. Those outputs are expensive to produce.

Three structural costs drive this:

Cost type	Example	Typical magnitude
Annotation labour	A human writes an answer per question	$0.10 - $2.00 per example
Expert knowledge	A doctor labels clinical reasoning traces	$5 - $50 per example
Tail coverage	Examples for rare failure modes are hard to source	Often zero examples available

Synthetic data attacks all three. A capable model generates answers, provides reasoning traces, and can be prompted specifically for edge cases and rare distributions. Self-Instruct (Wang et al., 2023) demonstrated this concretely: bootstrapping GPT-3 on its own synthetic instructions raised performance on Super-NaturalInstructions by 33 percentage points, reaching near-InstructGPT quality with minimal human annotation.

Four techniques for generating synthetic training data

Instruction synthesis. A seed set of human-written prompts is expanded by asking a model to generate novel instructions in the same style. The model produces both the instruction and the desired response. WizardLM's Evol-Instruct (Xu et al., 2023) pushed this further by iteratively rewriting instructions into progressively harder variants, producing a training set that proved more useful than the human-authored Alpaca dataset in head-to-head evaluations.

Knowledge distillation. A large, expensive "teacher" model labels unlabelled inputs; a smaller "student" model trains on those labels. The student never sees the teacher's weights - only its outputs. This is how the phi-1 model (Gunasekar et al., 2023) achieved strong code-generation results with only 1.3 billion parameters: it trained on GPT-3.5-generated "textbook quality" code examples rather than raw web scrapes. The key insight is that a textbook-style explanation of a concept is informationally denser than a Stack Overflow snippet of the same concept.

Rejection sampling. Generate many candidate outputs for each prompt, then keep only those that pass a quality filter. The filter can be a verifier model, a unit test, a formal checker, or a heuristic rule. For mathematical reasoning, a simple correctness check (does the final numeric answer match the ground truth?) lets you construct training sets of correct reasoning traces even when the base model's pass rate is low. If the model solves only 20% of problems on first try, generating ten samples per problem and keeping correct ones gives you a high-quality fine-tuning set.

Schematically:

?wzxhzdk:0?

The constitutional loop (RLAIF). Anthropic's Constitutional AI paper (Bai et al., 2022) introduced a two-stage process where the model critiques and revises its own outputs according to a written list of principles ("the constitution"), then a separate preference model is trained on the revised pairs. No human ever labels individual responses - humans only write the constitution once. This is Reinforcement Learning from AI Feedback (RLAIF). The practical payoff is that you can align a model's behaviour at scale without a bottleneck on human raters, though the quality ceiling is bounded by how well the critique model understands the principles.

Why synthetic data often works better than raw web data

Raw web text is noisy, inconsistently formatted, and rarely structured to teach a specific skill. Synthetic data can be generated to spec. The phi-1 result made this unusually legible: a 1.3B-parameter model trained on one billion tokens of GPT-3.5-generated exercises outperformed much larger models trained on billions of natural-code tokens. The proposed explanation is that synthetic educational text mirrors how concepts are taught rather than how they appear in production code - with explanations, build-up, and worked examples rather than raw, uncommented code.

A second advantage is coverage. Human-curated datasets reflect the biases of the curators: common cases are over-represented, rare failure modes are absent. A generative model can be explicitly prompted to produce examples from underrepresented categories, hard adversarial cases, or specific reasoning patterns the base model currently fails on.

When it falls down

Distribution shift from the generator. If the teacher model has systematic errors or gaps, the student learns them too. A distillation pipeline that used GPT-3.5 in early 2023 to generate medical reasoning would have inherited its hallucination patterns. There is no quality floor below which rejection sampling can save you: if the base model never generates a correct answer for a hard problem class, rejection sampling yields nothing for that class.

Self-consuming loops and model collapse. When synthetic data is fed back into training iteratively, without sufficient injection of real data, the distribution narrows. The paper "Self-Consuming Generative Models Go MAD" (Alemohammad et al., 2023) showed that in autophagous loops - where each generation trains on the previous generation's outputs - models lose diversity monotonically. Quality (precision) or diversity (recall) degrades, sometimes both. The analogy to prion disease (MAD) is apt: the corruption is self-amplifying. In LLM terms, a model that trains heavily on its own outputs risks collapsing toward a high-probability narrow manifold, losing the ability to produce rare but correct responses.

Reward hacking in the constitutional loop. When the critique model and the policy model are architecturally similar, the policy can learn to game the critique rather than genuinely improving. The constitutional principles become a prompt to optimise against rather than a genuine constraint. This is the synthetic-data version of Goodhart's Law: the metric (critique score) diverges from the goal (actual harmlessness or helpfulness) once the metric is used as a training target.

Coverage gaps for rare capabilities. Instruction synthesis generates instructions that look like instructions already in the seed set. Truly novel capability domains - a new programming language released after training, a rare dialect, a specialised scientific subfield - are underrepresented or absent. The model generates what it knows how to generate, which means capability gaps are not filled by more synthetic data of the same type.

Evaluation contamination. Benchmarks used to measure progress can leak into synthetic data pipelines, either because the teacher was trained on benchmark-adjacent data or because synthetic generation prompts inadvertently reference benchmark formats. A model that scores well on a contaminated benchmark may not generalise.

The core problem synthetic data solves

Four techniques for generating synthetic training data

Why synthetic data often works better than raw web data

When it falls down

Further reading