Borrowed Intelligence: How Knowledge Distillation Builds Small Language Models That Punch Above Their Weight

In 2024, Google trained Gemma 2 at 2 billion and 9 billion parameters and made a quiet but radical choice: it did not train those two models on the usual diet of next-token prediction. Instead it fed them the full output distribution of a much larger teacher at every position, and ran that process across more than fifty times the number of tokens that scaling theory calls compute-optimal for a model that size (Gemma Team, 2024, Gemma 2: Improving Open Language Models at a Practical Size, arXiv:2408.00118). The resulting 9B model competes with open models two to three times larger. The trick was not a new attention variant or a clever positional scheme. It was distillation: the student learned to imitate a teacher's beliefs rather than rediscover the world from one-hot labels.

That inversion, training a small model against a large model's soft predictions instead of against ground truth alone, is the engine behind the current generation of small language models (SLMs). The phone in your pocket, the laptop running a coding assistant offline, the edge device summarizing documents without a network call: most of them owe their competence to a teacher they will never meet at inference time.

Why this matters: The dominant story about LLM progress is "bigger is better." Distillation is the counter-story, and it is the one that actually reaches users. It decides whether a capable model fits in 4 GB of RAM or demands a data-center GPU, and it is reshaping how labs spend their training compute.

TL;DR

Knowledge distillation trains a small student to match a large teacher's full probability distribution, not just the correct token. The soft distribution carries information that a one-hot label throws away.
The idea is old (Hinton, Vinyals, and Dean formalized it in 2015), but its application to generative language models forced two major revisions: sequence-level distillation and on-policy distillation.
Forward KL divergence, the textbook objective, makes a student try to cover every mode of the teacher and produces bland, hedging text. Reverse KL makes the student commit to the teacher's dominant modes, which matters for generation quality.
On-policy methods (the student learns from its own sampled outputs, graded by the teacher) close the train-inference gap that plagues naive distillation.
Distillation scaling laws (2025) now let teams predict student quality from a compute budget and decide whether distilling beats plain pretraining.
Phi-3-mini (3.8B) and Gemma 2 (2B, 9B) are the production proof: small models, distillation-shaped training, performance that embarrasses the parameter count.
Distillation is not free. A strong teacher can mislead a weak student (capacity gap), and a distilled model inherits its teacher's blind spots and biases.

At a Glance

The whole mechanism fits in one picture: a teacher emits a soft distribution over the vocabulary, the student emits its own, and a divergence loss pulls them together while a smaller cross-entropy term keeps the student honest against real labels.

flowchart LR
  X[Input tokens] --> T["Teacher model<br/>frozen, large"]
  X --> S["Student model<br/>trainable, small"]
  T --> PT[Soft distribution<br/>over vocabulary]
  S --> PS[Student distribution]
  PT --> L[Divergence loss<br/>KL teacher vs student]
  PS --> L
  X --> Y[Ground-truth token]
  Y --> CE[Cross-entropy loss]
  PS --> CE
  L --> G[Combined gradient]
  CE --> G
  G --> S
  class X,Y blue
  class T,PT purple
  class S,PS teal
  class L,CE,G amber
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff

[IMAGE: Side-by-side bar charts of a teacher's softmax over the vocabulary at one position. Left bar chart at temperature 1 shows one tall spike; right at temperature 4 shows several visible competing tokens. Annotate the "dark knowledge" the higher temperature reveals.]

Before Distillation Scaled Down

The compression instinct predates language models. In the mid-2000s, Bucila, Caruana, and Niculescu-Mizil showed you could compress an ensemble into a single small network by training it to mimic the ensemble's outputs. The idea sat mostly dormant until Geoffrey Hinton, Oriol Vinyals, and Jeff Dean gave it its modern framing in 2015 (Hinton et al., 2015, Distilling the Knowledge in a Neural Network, arXiv:1503.02531). Their reframing was conceptual, not just technical: a trained network's output probabilities encode "dark knowledge," the relative similarities between classes that a hard label erases. A model that labels an image "dog" also assigns small but meaningful probability to "wolf" and almost none to "airplane." Those ratios teach the student something the ground-truth label cannot.

For classification this was elegant. For language it was harder, because language generation is sequential and the space of outputs is combinatorial. The first serious adaptation came from Yoon Kim and Alexander Rush, who introduced sequence-level knowledge distillation for neural machine translation (Kim and Rush, 2016, Sequence-Level Knowledge Distillation, arXiv:1606.07947). Rather than match the teacher token by token against the reference, they let the teacher generate translations by beam search and trained the student on those teacher outputs. Their best student ran roughly ten times faster than the teacher with a BLEU drop of only 0.2 points.

Then BERT arrived, and with it the pressure to deploy. DistilBERT showed that a transformer encoder could be shrunk during pretraining: 40 percent fewer parameters, 60 percent faster inference, while retaining about 97 percent of BERT's language-understanding score (Sanh et al., 2019, DistilBERT, a distilled version of BERT, arXiv:1910.01108). That result moved distillation from a research curiosity into the standard deployment toolkit.

[IMAGE: A scaled diagram of BERT-base next to DistilBERT, twelve transformer layers shrinking to six, with callouts for "40% fewer params, 60% faster, ~97% of GLUE retained."]

The generative era exposed a flaw the encoder work could hide. Autoregressive models are trained on one distribution (the data, or the teacher's reference) but sampled from another (their own previous tokens). Distilling them naively inherits this mismatch and amplifies it. The fixes for that mismatch (reverse KL, on-policy sampling) define the modern playbook.

timeline
  title Evolution of Knowledge Distillation
  2006 : Model compression : ensemble into one net
  2015 : Hinton soft targets : temperature and dark knowledge
  2016 : Sequence-level KD : teacher beam-search outputs
  2019 : DistilBERT : 40 percent smaller encoder
  2023 : On-policy distillation : MiniLLM and GKD
  2024 : Distilled SLMs ship : Gemma 2 and Phi-3
  2025 : Distillation scaling laws : compute-optimal recipes

How Distillation Actually Works

Strip away the variants and one equation sits at the center. A teacher with parameters fixed produces, at each token position, a probability distribution \(p\) over the vocabulary. The student with parameters \(\theta\) produces \(q_\theta\). Training minimizes a divergence between them, almost always a Kullback-Leibler divergence, usually blended with the ordinary cross-entropy against the true next token.

Soft targets and temperature

The teacher's raw logits \(z_i\) become probabilities through a temperature-scaled softmax:

\[q_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}\]

At \(T = 1\) this is the normal softmax. Raise \(T\) and the distribution flattens, pulling up the small probabilities that carry the inter-token similarity structure. Hinton's insight was that training the student at the same elevated temperature exposes this structure; the gradient from a soft target with three or four plausible continuations is far richer than the gradient from a single one-hot vector. The student learns not just "the next word is bank" but "bank is likely, shore is plausible, airplane is absurd." That ranking is the borrowed intelligence.

The combined loss for the soft-target approach takes the shape

\[\mathcal{L} = \alpha \, T^2 \, D_{KL}(p_T \,\|\, q_{\theta,T}) + (1 - \alpha)\, \mathcal{L}_{CE}(y, q_\theta)\]

where \(p_T\) and \(q_{\theta,T}\) are the temperature-softened teacher and student distributions, \(y\) is the ground-truth token, and the \(T^2\) factor rescales the soft-target gradient so it stays comparable in magnitude to the hard-label term as temperature changes.

Which direction of KL?

Here the generative case diverges sharply from the classification case. The forward KL, \(D_{KL}(p \,\|\, q_\theta)\), penalizes the student wherever the teacher places mass that the student misses. It is mode-covering: the student spreads its probability to avoid leaving any teacher mode uncovered. For a classifier that is fine. For a generator it is a problem, because a model that hedges across every continuation the teacher ever considered produces vague, averaged text and assigns probability to sequences the teacher would rarely actually sample.

Yuxian Gu and colleagues attacked this directly with MiniLLM by swapping the objective for reverse KL, \(D_{KL}(q_\theta \,\|\, p)\) (Gu et al., 2023, MiniLLM: Knowledge Distillation of Large Language Models, arXiv:2306.08543). Reverse KL is mode-seeking: it punishes the student for putting mass where the teacher does not, so the student concentrates on the teacher's dominant modes and stops hallucinating low-probability tail sequences. The practical payoff they report is text with higher overall quality, better calibration, and lower exposure bias, across student sizes from 120M to 13B parameters.

The contrast is worth holding onto:

Objective	Behavior	Effect on generation
Forward KL \(D_{KL}(p \,\\|\, q_\theta)\)	Mode-covering	Student hedges; bland, averaged text
Reverse KL \(D_{KL}(q_\theta \,\\|\, p)\)	Mode-seeking	Student commits to teacher's strong modes; sharper text

The train-inference gap

A second problem is subtler. If you train the student only on text the teacher (or the dataset) produced, the student never practices recovering from its own mistakes. At inference it conditions on its own generated prefix, which drifts away from anything it saw in training. This is exposure bias, and it is why a model can look excellent on teacher-forced loss and still wander off when it free-runs.

On-policy distillation fixes this by sampling sequences from the student during training and having the teacher score them. Rishabh Agarwal and colleagues formalized this as Generalized Knowledge Distillation (Agarwal et al., 2023, GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models, arXiv:2306.13649). The student generates, the teacher grades the student's own tokens, and the loss pulls the student toward the teacher exactly on the distribution the student will actually face at inference. It consistently beat standard token-level distillation on summarization, translation, and arithmetic reasoning. MiniLLM's optimization is on-policy in the same spirit, deriving a policy-gradient style update for the reverse-KL objective.

flowchart TD
  Start[Pick distillation objective] --> Q1{Generation<br/>or classification?}
  Q1 -->|Classification| FK[Forward KL<br/>soft targets]
  Q1 -->|Generation| Q2{Train-inference<br/>gap a concern?}
  Q2 -->|No, offline corpus| SEQ[Sequence-level KD<br/>on teacher outputs]
  Q2 -->|Yes| ONP["On-policy distillation<br/>student samples, teacher grades"]
  ONP --> Q3{Hedging<br/>vs committing?}
  Q3 -->|Commit to modes| RK[Reverse KL]
  Q3 -->|Cover all modes| FK2[Forward KL on-policy]
  class Start,Q1,Q2,Q3 slate
  class FK,FK2,SEQ blue
  class ONP,RK emerald
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff

[IMAGE: A two-panel diagram of forward vs reverse KL fitting a bimodal teacher distribution. Forward KL student is a wide single hump straddling both modes; reverse KL student is a narrow hump locked onto the taller mode. Label "mode-covering" and "mode-seeking."]

Seeing It in Motion

The static loss equation hides the loop that makes on-policy distillation tick. The student is both the thing being trained and the thing generating the training data, which means the data distribution moves as the student improves. The sequence below traces one training step.

sequenceDiagram
  participant D as Data prompt
  participant S as Student
  participant T as Teacher
  participant O as Optimizer
  D->>S: Provide prompt
  S->>S: Sample continuation on-policy
  S->>T: Send student tokens
  T->>T: Score each token, full distribution
  T->>O: Teacher distribution per position
  S->>O: Student distribution per position
  O->>O: Compute reverse-KL gradient
  O->>S: Update student weights
  Note over S,T: Repeat. Student drifts toward teacher

The architecture that wraps this loop in production has more moving parts than the math implies. A teacher this large is expensive to run, so teams cache its logits or precompute soft targets where the data is static, and only run the teacher live where on-policy sampling demands it. Storage of top-k teacher logits, rather than the full vocabulary distribution, is the usual compromise.

graph TD
  subgraph Offline
    C[Training corpus] --> TG[Teacher inference]
    TG --> LC[Top-k logit cache]
  end
  subgraph Online_loop
    LC --> DL[Distillation trainer]
    SP[Student samples] --> TR[Teacher re-scoring]
    TR --> DL
    DL --> SM[Student checkpoint]
    SM --> SP
  end
  SM --> EX[Export to 4-bit<br/>on-device build]
  class C,LC blue
  class TG,TR,DL purple
  class SM,EX teal
  class SP slate
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0

[IMAGE: System architecture showing the teacher logit cache feeding a trainer, with a separate live re-scoring path for on-policy samples, ending in a quantization-and-export box. Annotate where compute and storage costs concentrate.]

By the Numbers

Distillation's value shows up in two places: the quality a small model reaches, and the compute arithmetic of getting there. The headline production results below come from the models' own technical reports.

Model	Params	Training signal	Reported result	Source
DistilBERT	~66M	Pretraining distillation	~97% of BERT GLUE, 60% faster	Sanh et al., 2019
Kim and Rush student	small NMT	Sequence-level KD	~10x faster, BLEU drop ~0.2	Kim and Rush, 2016
Phi-3-mini	3.8B	Curated and synthetic data	Rivals Mixtral 8x7B and GPT-3.5	Abdin et al., 2024
Phi-3-small	7B	Same recipe, scaled	~75% MMLU	Abdin et al., 2024
Phi-3-medium	14B	Same recipe, scaled	~78% MMLU	Abdin et al., 2024
Gemma 2	2B, 9B	Distillation pretraining	Competes with 2-3x larger models	Gemma Team, 2024

A few of these numbers reward a second look. Phi-3-mini's claim is that 3.8 billion parameters, trained on 3.3 trillion tokens of heavily filtered web data plus synthetic "textbook-style" data, can rival a model with roughly an order of magnitude more active parameters (Abdin et al., 2024, Phi-3 Technical Report, arXiv:2404.14219). The Phi line is not pure logit distillation; it is closer to data distillation, where a strong model authors the curriculum. That blurring of "distillation" into "synthetic data from a teacher" is one of the defining moves of the current era, and it traces back to the Phi-1 thesis that data quality dominates (Gunasekar et al., 2023, Textbooks Are All You Need, arXiv:2306.11644).

Gemma 2's number to sit with is the 50x. Training a 2B or 9B model on more than fifty times the compute-optimal token count would normally be a waste under Chinchilla-style reasoning, because the marginal token stops helping. Distillation changes that calculus: with a teacher distribution as the target instead of one-hot labels, each token carries more bits of signal, and the extra tokens keep paying off well past the point where next-token prediction would plateau.

[IMAGE: Grouped bar chart of MMLU score against active parameter count, plotting Phi-3-mini (3.8B) and Gemma 2 (9B) against larger models, with the distilled small models marked to show they sit above the trend line for their size.]

The 2025 distillation scaling laws put structure on this intuition. They give a predictive law that estimates a distilled student's loss from the compute split between training the teacher and training the student (Busbridge et al., 2025, Distillation Scaling Laws, arXiv:2502.08606). The actionable finding: if a capable teacher already exists, or if you plan to distill many students from it, distillation beats supervised pretraining up to a student-size-dependent compute threshold. If you must train the teacher from scratch to distill a single student, plain supervised pretraining is the better bet. Distillation is an amortization play; its economics depend on reuse.

A Concrete Example

Walk one training position through on-policy reverse-KL distillation with a toy vocabulary of five tokens: {the, bank, river, money, airplane}. The prompt is "She sat on the grassy slope beside the". The student samples its own next token, and the teacher scores the full distribution.

Suppose the teacher (a strong model) produces this distribution at \(T = 1\):

Token	Teacher \(p\)	Student \(q_\theta\) (start)
the	0.02	0.10
bank	0.62	0.25
river	0.30	0.20
money	0.04	0.30
airplane	0.02	0.15

The student has badly misjudged the context. It thinks money is the most likely continuation, a financial-bank reading, when the slope and grass make a river-bank reading far stronger. The teacher knows bank and river dominate.

Forward KL, \(D_{KL}(p \,\|\, q_\theta)\), would push the student to put mass on everything the teacher likes, including a little on the and money, smearing the update. Reverse KL, \(D_{KL}(q_\theta \,\|\, p)\), does the opposite. The term \(\sum_i q_i \log(q_i / p_i)\) is dominated by tokens where the student has high probability but the teacher has low: money (0.30 vs 0.04) and airplane (0.15 vs 0.02). The gradient hammers those two down hard, and pulls the freed mass toward bank and river where the teacher is confident.

After one update the student might look like:

Token	Teacher \(p\)	Student \(q_\theta\) (after)
the	0.02	0.04
bank	0.62	0.45
river	0.30	0.33
money	0.04	0.10
airplane	0.02	0.08

The student has moved decisively toward the teacher's modes without trying to perfectly clone the long tail. Because the sampling was on-policy, the position being corrected is one the student actually generated, so the correction lands exactly where the student's free-running behavior needed it. Repeat across billions of such positions and the student internalizes the teacher's judgment about which continuations the context supports.

[IMAGE: Before-and-after bar pair for the five-token vocabulary, with downward arrows on money and airplane and upward arrows on bank and river, visualizing the reverse-KL gradient redistributing mass toward the teacher's modes.]

Where It Breaks

Distillation is not a universal solvent, and the failure modes are specific.

The capacity gap is the first. A teacher that is too strong relative to the student can hurt. Its distribution encodes distinctions the student lacks the parameters to represent, so the student spends capacity chasing structure it cannot hold and ends up worse than a student distilled from a more moderate teacher. This is counterintuitive and well documented across the distillation literature; the best teacher for a given student is often not the strongest available model but one a controlled step above it.

Reverse KL's mode-seeking is a double-edged tool. By design it discards the teacher's tail. For most generation that improves quality, but if the task genuinely needs diversity (creative writing, exploration, calibrated uncertainty estimates), a mode-collapsed student is a liability. It will be fluent and confident and narrow.

On-policy distillation is expensive in a way the equation hides. Running the teacher to score the student's fresh samples at every step means the teacher participates in training, not just in a one-time data-generation pass. For a teacher with hundreds of billions of parameters, that inference cost can dominate the training budget, which is part of why teams cache logits and re-score selectively rather than running the teacher live everywhere.

Inheritance is the quiet failure. A student learns the teacher's biases, factual errors, and safety gaps along with its competence; distillation copies the whole distribution, including the parts you would rather not propagate. A distilled model also inherits the teacher's training-data cutoff and blind spots, and no amount of student-side fine-tuning fully launders them.

Finally, the synthetic-data variant of distillation risks a feedback loop. When teachers generate the curriculum for students that may themselves become future teachers, errors and stylistic tics can compound across generations. The research community has flagged model collapse from training on model-generated data as a real concern, though the severity in practice depends heavily on how much genuine human data remains in the mix.

[IMAGE: A line chart of student accuracy vs teacher size for a fixed student, showing a peak then a decline, illustrating the capacity gap where an over-strong teacher hurts the student.]

Alternative Designs

Distillation is one tool in the compression and small-model toolbox, and it composes with the others more often than it competes.

Approach	Strengths	Weaknesses	Best when
Knowledge distillation	Transfers behavior and judgment, not just weights	Needs a teacher; can copy teacher flaws	A strong teacher exists and will be reused
Quantization	Cheap, post-hoc, no retraining	Limited compression; precision floor	Shrinking an already-trained model for deployment
Pruning	Removes redundant weights structurally	Aggressive pruning needs fine-tuning to recover	Latency-bound serving on existing hardware
Pretrain small from scratch	No teacher dependency; clean lineage	Wastes the signal a teacher could provide	No suitable teacher, or single-student budget
Synthetic-data training (Phi style)	Controls the curriculum directly	Hard to verify data quality at scale	You can author or curate a high-quality corpus

The pairing that ships is distillation plus quantization. A distilled student that already reaches strong quality at a small parameter count is then quantized to 4-bit weights to fit on-device, which is exactly the export step in the architecture diagram above. Pruning enters when serving latency, not parameter count, is the binding constraint. And the scaling-law analysis reframes the "distill vs pretrain" choice not as a rivalry but as a budget decision: the answer flips depending on whether the teacher's cost is amortized across many students.

How It Is Used in Practice

The clearest production signal is that frontier labs now ship distilled small models as deliberate products, not afterthoughts. Google's Gemma 2 at 2B and 9B used distillation pretraining as its central training decision, and Google has positioned the small Gemma variants explicitly for on-device and cost-sensitive deployment. Microsoft's Phi-3-mini was engineered around the claim in its own subtitle, a highly capable model running locally on a phone, and the report documents a quantized 4-bit build occupying roughly 1.8 GB and running offline on a modern handset.

The pattern beneath these is a teacher-student supply chain. A lab trains one expensive flagship, then distills a family of smaller models from it for different deployment points: a mid-size model for cost-sensitive API traffic, a small model for on-device, a tiny model for latency-critical features. The flagship's training cost is amortized across the whole family, which is precisely the regime the distillation scaling laws identify as favorable.

There is also a fast-growing gray area where "distillation" means an API consumer training a student on a larger model's generated outputs. This is technically sequence-level distillation, and it is potent enough that several model providers' terms of service restrict using their outputs to train competing models. The technique that started as Hinton compressing an ensemble has become a competitive and legal flashpoint, which is its own kind of evidence that it works.

[IMAGE: A supply-chain diagram showing one flagship teacher branching into a mid-size, small, and tiny distilled student, each tagged with a deployment target (API, laptop, phone).]

Insights Worth Remembering

A one-hot label is a lossy summary of the truth; a teacher's distribution is the truth with its uncertainty intact, and that uncertainty is the part worth stealing.

The direction of the KL divergence is not a technicality. Forward KL builds a diplomat that agrees with everyone; reverse KL builds a specialist that commits. The choice is a decision about what kind of generator you want.

On-policy distillation works because it corrects the student where the student actually is, not where the teacher wishes it were. Most of distillation's hard-won progress is really about closing the gap between training distribution and inference distribution.

Distillation's economics are amortization economics. One teacher distilled into many students is a bargain; one teacher trained to distill one student is usually a mistake. The scaling laws made this quantitative rather than folkloric.

The best teacher is not always the biggest. A capacity gap means an over-strong teacher can teach distinctions the student cannot hold, and the student suffers for the ambition.

Data distillation and logit distillation are converging. When a teacher authors a student's curriculum, the line between "distilling a model" and "generating training data" disappears, and the Phi results suggest the curriculum may matter more than the gradient.

Open Questions

Whether reverse KL is the right objective in general remains unsettled. It demonstrably improves single-mode generation quality (this is measured), but the field is still working out how to get the calibration and diversity benefits of mode-covering without the blandness, and several hybrid divergences are under active study. Treat "reverse KL is simply better" as an oversimplification rather than a settled result.

The capacity-gap phenomenon is observed repeatedly but not fully explained. We can describe when an over-strong teacher hurts; a predictive theory of the optimal teacher-student size ratio for a given task is still open, and the distillation scaling laws are an early and partial step toward one.

Model collapse from recursive distillation is a genuine risk whose real-world severity is unresolved. Whether iterated teacher-to-student-to-teacher pipelines degrade meaningfully depends on the fraction of authentic human data retained, and the empirical evidence so far is mixed rather than alarming. This is a place to separate the demonstrated mechanism (collapse can happen in controlled setups) from the speculative leap (it will doom real pipelines).

The legal and economic status of output-based distillation is unsettled in a way that may shape the field as much as any technical result. If training students on a competitor's outputs is both effective and contractually forbidden, the question of how that line is drawn and enforced is likely to influence which distillation methods labs are willing to publish, let alone ship.