Self-Instruct

Before Self-Instruct, getting a model to follow diverse instructions reliably meant paying humans to write thousands of (instruction, response) pairs. The 2022 paper by Wang et al. changed the equation: starting from just 175 seed tasks written by humans, their pipeline used GPT-3 to generate 52,000 new tasks, filtered them for quality, and fine-tuned a new model that closed 33 percentage points of the gap to InstructGPT on Super-NaturalInstructions. The cost was a fraction of a human-annotation campaign; the lesson echoed across the field within months.

The Bootstrapping Loop

The core idea is seductively simple: a capable-enough base model can write instructions it does not yet know how to follow well, and those instructions, once it is fine-tuned on them, teach it to follow instructions better.

The pipeline has four stages:

Stage	What happens
1. Seed pool	175 human-written tasks prime the instruction space
2. Instruction generation	Model samples 8 tasks from the pool, then writes a new one
3. Instance generation	For each new instruction, the model produces input-output pairs
4. Filtering	Near-duplicate and degenerate samples are removed; survivors enter the pool

The loop then repeats: the new pool feeds Stage 2 again, expanding diversity without additional human effort.

A sketch of Stage 2 in pseudo-code:

pool = seed_tasks          # 175 human-written examples
for step in range(N):
    context = sample(pool, k=8)         # few-shot prompt
    new_instruction = model.generate(context)
    instances = model.generate_instances(new_instruction)
    if passes_filter(new_instruction, instances):
        pool.append((new_instruction, instances))

The "passes_filter" step is doing real work: ROUGE-L similarity is computed against every existing instruction and the new one is dropped if the maximum similarity exceeds 0.7. This keeps the distribution broad rather than saturating a small subset of the task space.

Why the Filter Is Not Enough (and Why It Still Matters)

Diversity filtering solves the redundancy problem but not the quality problem. The model generates what it is biased toward. In the original experiments roughly 35% of generated instances were flagged as low-quality by a separate classifier and discarded. Tasks requiring genuine world knowledge, precise numerical reasoning, or multi-step code execution were disproportionately flawed because the base model was good at surface-level instruction pattern-matching, not at producing correct outputs in hard domains.

This asymmetry is worth internalising: Self-Instruct improves instruction breadth more reliably than it improves instruction correctness. The distinction matters because a fine-tuned model trained on plausible-but-wrong answers learns confident-but-wrong behaviour.

What Stanford Alpaca Proved

In March 2023, Stanford's CRFM team ran the most influential stress-test of the method. They applied a simplified Self-Instruct pipeline to LLaMA-7B, using text-davinci-003 as the generator:

52,000 instruction-following demonstrations generated via API
Total data-generation cost: under $500
Fine-tuning on 8 A100s for three hours: under $100
In blind human comparisons against text-davinci-003, Alpaca won 90 comparisons to 89

A 7-billion-parameter open model, trained for less than the cost of a mid-range laptop, was judged comparable to a proprietary model trained on vastly more data and compute. The headline numbers spread quickly; the caveats (hallucination, toxicity, narrow evaluation) travelled more slowly.

The Distillation Reading

There is a subtlety in how Self-Instruct is framed versus how it behaves in practice. The original paper uses the base model as both generator and student. Stanford Alpaca uses a stronger model (text-davinci-003) to generate data and a weaker model (LLaMA-7B) to learn from it. That second variant is distillation, not pure self-improvement.

The distinction matters for three reasons:

Capability ceiling. A model cannot reliably self-improve beyond the capabilities of its generator. If the generator is the same model, the ceiling is set by what the model could already do zero-shot. If the generator is a stronger teacher, the ceiling rises to the teacher's competence.
Licence and legal exposure. Generating training data from a proprietary model API and using it to fine-tune a competing model raises terms-of-service and potential IP questions, as the Alpaca authors themselves noted.
Error propagation. Stronger teachers make better mistakes, in the sense that their errors are subtler and harder to filter mechanically. ROUGE-L is not sensitive to factual accuracy; a beautifully formatted wrong answer passes the filter easily.

When It Falls Down

Capability echo chamber. The generator cannot reliably produce correct outputs for tasks beyond its current competence. Fine-tuning on those outputs embeds the errors. Iterating this loop without external verification amplifies the original model's blind spots.

Surface-format overfitting. Models trained on Self-Instruct data often learn the format of instructions (answer in bullet points, start with "Sure, here is...") rather than the underlying task semantics. This is the origin of the sycophantic, over-hedging style that became notorious in early chat models.

Diversity collapse at scale. Even with ROUGE-L filtering, repeated sampling from the same model tends to collapse toward the instruction types the model is most confident about. Tasks requiring rare or specialised knowledge become underrepresented, not because they are filtered out, but because the model never generates them with high confidence in the first place.

No verification signal. Self-Instruct has no mechanism to check whether generated answers are correct. Rejection sampling (keeping only outputs that pass some verifier) and constitutional loops (having the model critique and revise its own outputs) are later-stage patches that address this, but they are not part of the original pipeline.

Amplified biases. The generator's social and factual biases propagate into the training set. If the base model has skewed views on demographic topics or consistently misstates certain facts, those patterns appear in thousands of generated instances and are then trained in more firmly.

The Bootstrapping Loop

Why the Filter Is Not Enough (and Why It Still Matters)

What Stanford Alpaca Proved

The Distillation Reading

When It Falls Down

Further Reading