Evol-Instruct

The original Alpaca dataset contained 52,000 instructions generated from GPT-3 in a single pass. Every instruction in it was shallow: "Write a poem about spring", "Summarise this paragraph", "List five fruits". A model fine-tuned on that data behaved accordingly - fluent, but incapable of following a multi-step prompt that imposed four competing constraints simultaneously. The problem was not quantity; it was complexity distribution. Evol-Instruct was designed to fix the distribution.

What Evol-Instruct Does

Evol-Instruct (introduced with WizardLM, ICLR 2024) is a mutation pipeline: take a seed instruction, ask an LLM to rewrite it into a harder version, repeat for several rounds, discard outputs that fail a quality filter. The result is a dataset whose complexity spans a wide spectrum rather than clustering at the trivial end.

The pipeline has three main components.

Depth evolution makes an existing instruction harder without changing its topic. The paper describes five operators applied stochastically:

Operator	What it does
Add constraints	Introduces additional conditions the response must satisfy
Deepening	Requires the model to engage with a more specialised sub-problem
Concretising	Replaces vague terms with specific technical ones
Increased reasoning	Demands multi-step logical inference rather than recall
Complicate input	Makes the input artefact itself more intricate (longer code, nested data)

Breadth evolution generates entirely new instructions on related but distinct topics, expanding subject-matter coverage. Each breadth step starts from an existing instruction and prompts the LLM: "Write a completely new, harder instruction inspired by, but different from, the original."

Elimination filtering discards evolved outputs that are obviously malformed. A second LLM call judges whether the evolved instruction: (a) uses specific vocabulary rather than vague filler, (b) is actually harder than its parent, (c) does not simply copy the parent, and (d) does not contain meta-commentary from the rewriting prompt itself. Instructions failing any criterion are dropped.

The full WizardLM V2 dataset reached roughly 196,000 evolved examples, mixing Alpaca seeds with ShareGPT conversations put through the same evolution loop.

Why Complexity Distribution Matters

Standard instruction tuning on a fixed dataset teaches a model to match the distribution of that dataset. If the dataset contains only simple instructions, the model generalises poorly to complex ones - not because the underlying weights lack capacity, but because the loss landscape never received gradient signal from hard examples.

Evol-Instruct forces the training distribution to cover harder regions by construction. You can think of the successive evolution rounds as curriculum generation: round 0 instructions are easy, round 3 or 4 instructions may impose five nested constraints. A model fine-tuned on all rounds learns to handle the full difficulty spectrum.

Formally, let \(D_0\) be the seed dataset. After \(k\) depth-evolution steps the dataset becomes \(D_k\), where each instruction \(x_i^{(k)}\) is derived from \(x_i^{(k-1)}\) by applying a randomly sampled mutation operator \(m\) via:

\[x_i^{(k)} = \text{LLM}(\text{prompt}_m, x_i^{(k-1)})\]

The final training set is the union \(\bigcup_{k=0}^{K} D_k\), so easy and hard examples coexist and the model is not forced to forget simpler skills.

The Elimination Filter in Practice

Without filtering, LLMs evolving their own outputs introduce characteristic failure modes. The evolved instruction sometimes:

Repeats the original verbatim with a single adjective changed.
Contains meta-text like "Now I will make the instruction more complex by..."
Becomes incoherent or unanswerable (too many conflicting constraints stacked in one pass).
Produces an empty string when the model refuses an instruction it deems sensitive.

The elimination filter catches most of these by prompting a judge model with a short rubric. In the WizardLM experiments, filtering removed a meaningful fraction of evolved outputs at later rounds, which is itself informative: it shows diminishing returns as instructions get harder to evolve coherently.

The filtering step is also where cost concentrates. Each seed instruction may generate multiple evolution candidates across rounds, each requiring a separate API call for both generation and judgment. For large seed sets, this makes Evol-Instruct considerably more expensive than a single-pass generation method like Self-Instruct.

Extensions: WizardCoder and WizardMath

The Evol-Instruct principle transferred cleanly to specialised domains.

WizardCoder (ICLR 2024) adapted the mutation operators for code tasks. Depth operators were modified to increase algorithmic complexity, add edge cases, or require the code to operate under memory constraints. Starting from the Code Alpaca dataset (~20k code instructions), the evolved dataset improved HumanEval pass@1 scores substantially, briefly surpassing closed-source models including an earlier version of Claude on that benchmark.

WizardMath introduced Reinforcement Learning from Evol-Instruct Feedback (RLEIF). Rather than stopping at supervised fine-tuning on evolved data, the pipeline adds a process reward model that scores reasoning chains step by step, and the evolved instructions are used both for SFT and as prompts during RL. WizardMath-Mistral 7B reached competitive results on GSM8k and the harder MATH benchmark, matching models with significantly more parameters. The approach was accepted as an oral at ICLR 2025.

When It Falls Down

Distribution shift from the evolution LLM. The evolved instructions carry the stylistic fingerprint of whatever model generated them (GPT-4 in the original WizardLM work). A student model fine-tuned on this data learns, in part, to mimic that generator's blind spots and refusal patterns, not just to follow complex instructions.

Quality collapse at high evolution depth. Beyond roughly four or five rounds, the elimination filter rejects an increasing fraction of outputs, and those that survive often have a peculiar character: grammatically complex but semantically fragile. The constraints accumulate faster than meaningful difficulty.

Homogenisation within a seed topic. Depth evolution does not escape the topic of the seed instruction. A seed about sorting algorithms generates a chain of harder sorting prompts, not a diverse curriculum across computer science. Breadth evolution is supposed to address this, but in practice the breadth-evolved instructions cluster around topics well-represented in the generator LLM's prior.

No ground-truth quality signal. The elimination filter is itself an LLM, so the quality judgments are as noisy as LLM outputs generally are. Systematic biases in the judge (for instance, preferring verbose instructions) propagate silently into the training set. Unlike rejection sampling with a verifiable reward signal (as in mathematical reasoning with symbolic verification), Evol-Instruct has no external oracle.

Cascading errors in multi-round evolution. Each round conditions on the previous round's output. If an error or ambiguity enters at round 2, rounds 3 and 4 inherit it. The seed quality is therefore disproportionately important, and low-quality seeds can produce entire chains of plausibly-formatted but subtly broken training examples.

What Evol-Instruct Does

Why Complexity Distribution Matters

The Elimination Filter in Practice

Extensions: WizardCoder and WizardMath

When It Falls Down

Further Reading