Textbook-Quality Synthetic Data

Microsoft's phi-1 model reached 50.6% on HumanEval in 2023 with 1.3 billion parameters, a score that competing models 10x its size could not beat. The team did not use a secret architecture. They changed one thing: instead of training on raw web text, they generated roughly one billion tokens of synthetic "textbook-quality" Python exercises using GPT-3.5, then added 6 billion tokens of carefully filtered real code. That result reframed the question from "how much data?" to "what kind of data?"

This concept unpacks the methods used to manufacture that kind of data, why they work, and where each one breaks.

What makes training data "textbook quality"?

Web text is an accurate mirror of human writing: repetitive, off-topic, inconsistent in rigour. A textbook is something different. It introduces a concept, provides a worked example, revisits the concept at a higher level, and uses precise language. When a language model trains on textbook text, it sees many distinct facets of the same idea rather than many superficially different surface forms of the same noise.

Operationally, "textbook quality" means:

Property	Web text (typical)	Textbook-quality
Density of distinct concepts	Low	High
Instructional clarity	Incidental	Deliberate
Worked examples	Rare	Expected
Logical progression	None	Structured
Noise (ads, boilerplate)	High	Removed

The key insight from the phi-1 work is that this quality can be manufactured. Given a base model capable enough to follow detailed prompts, you can request synthetic text with a specific pedagogical structure rather than hoping that structure appears naturally in a web crawl.

Instruction synthesis and distillation

Self-Instruct (Wang et al., ACL 2023) showed that a model can bootstrap its own instruction-following data. Starting from a small seed of 175 hand-written instruction-output pairs, a frozen language model generates thousands more by:

Sampling a few seed tasks as in-context examples.
Asking the model to produce a new, distinct task.
Generating an input and output for that task.
Filtering near-duplicates (ROUGE overlap > 0.7) and format violations.

The resulting pipeline produces roughly 52,000 diverse instruction pairs from a base model that could not itself follow instructions reliably. This is the paradox at the centre of instruction synthesis: the generator only needs to be better than random at producing valid examples; the student model, trained on the aggregate, averages out individual noise.

Distillation is a sharper variant. Here a stronger "teacher" model (often a frontier API) produces chain-of-thought solutions, explanations, or critiques. The student trains to reproduce the output, implicitly absorbing the teacher's reasoning style. Orca (Mukherjee et al., 2023) demonstrated this explicitly: a 13B model trained on GPT-4's step-by-step explanations substantially outperformed models trained on GPT-4's final answers alone. The lesson is that the intermediate reasoning trace is itself high-value training data, not just a means to an answer.

A simplified distillation pipeline:

for prompt in task_pool:
    trace = teacher.generate(prompt, cot=True)   # full reasoning
    if verifier.passes(trace):
        dataset.add((prompt, trace))
student.finetune(dataset)

The verifier can be a rule-based checker (for code: does it compile and pass tests?) or another model acting as a judge.

Rejection sampling

Rejection sampling sits between distillation and reinforcement learning. The generator produces multiple candidate outputs for each prompt; a separate scoring function discards low-quality candidates and keeps only the top-k.

Formally, if the generator is p_θ(y|x) and the scorer assigns reward r(x, y), the filtered distribution is:

p̃(y|x) ∝ p_θ(y|x) · 1[r(x, y) ≥ τ]

where τ is a quality threshold. Fine-tuning on samples from p̃ shifts the model toward the accepted region of output space, often called best-of-n fine-tuning.

The Llama 2 technical report describes using rejection sampling for its RLHF pipeline: the model generates many candidate responses, a reward model scores them, and only the highest-scoring responses are used as supervised fine-tuning targets before each round of PPO. Rejection sampling acts as a cheap, stable bootstrapping step that conditions the policy closer to the reward model's preferred region before the more fragile RL update begins.

The quality of everything downstream depends entirely on the scorer. A reward model that conflates verbosity with quality will systematically filter toward long, padded answers. A code verifier that only checks for compilation will filter toward syntactically valid but semantically wrong programs.

The constitutional loop

Constitutional AI (Bai et al., Anthropic 2022) extends the self-critique idea into a full generative pipeline. The core insight is that a model can act simultaneously as policy, critic, and reviser.

The process in its supervised phase:

Elicit a harmful response. Prompt the model with a red-team question.
Self-critique. Prompt the same model to identify what is wrong with its answer, citing a principle from a written "constitution" (e.g., "identify ways this is dishonest or harmful").
Revise. Prompt the model to rewrite the response in accordance with the critique.
Iterate. Repeat critique-and-revise several times.
Fine-tune on (original prompt, final revised response) pairs.

In the RL phase, the model generates pairs of responses; a separate "AI feedback" model scores which is preferable according to the same constitution; this preference signal trains a reward model used for RLHF.

The constitutional loop produces a harm-avoidance dataset with almost no human annotation of individual examples. The specification moves from per-label human effort to one-time principle-writing, which scales much more easily. The risk is that the constitution itself encodes the authors' blind spots, and a model that generates both critiques and revisions may converge on a narrow mode of what "acceptable" looks like rather than genuinely diverse safe behaviour.

When it falls down

Mode collapse and distribution narrowing. If the generator and the scorer share a model family, the accepted samples may all share the same stylistic signature. Fine-tuning on these then shifts the student toward that style, which makes future generations even more uniform. Alemohammad et al. (2023) formalised this as "Model Autophagy Disorder": self-consuming loops without a continuous injection of real data see quality or diversity progressively degrade across generations.

Reward hacking. Rejection sampling is only as good as the scoring function. A code verifier that accepts any program passing unit tests will reward solutions that hard-code test outputs. A reward model trained on human preferences will be gamed by responses optimised to look good rather than be good.

Coverage gaps. Synthetic pipelines are conditioned on the generator's world model. Rare but important distributions (unusual edge cases, minority languages, domain-specific jargon) are systematically under-represented because the generator produces them infrequently. The final model may appear broadly capable while failing on exactly the cases it has never been taught to handle.

Hallucination laundering. A teacher model producing chain-of-thought traces can generate confident but wrong reasoning. If the verifier cannot check correctness (e.g., for open-domain factual questions), the student learns to reproduce confident-sounding errors. This is not a corner case: it is the default outcome whenever the scorer is cheaper than ground-truth verification.

Costs are not zero. Generating one billion tokens at GPT-3.5 pricing costs on the order of $500-2,000 depending on the year and rate. At GPT-4 rates, the same corpus is 10-30x more expensive. These are not prohibitive, but they mean that "use a stronger teacher to generate better data" is not a free operation; the quality-cost tradeoff must be modelled explicitly.

What makes training data "textbook quality"?

Instruction synthesis and distillation

Rejection sampling

The constitutional loop

When it falls down

Further reading