Architectures & Scaling
Knowledge Distillation from a Teacher Model
Knowledge distillation trains a smaller student model to reproduce a larger teacher's output distribution, enabling compact models with performance well beyond what their size alone would predict.
intermediate · 8 min read
A 13-billion-parameter model trained on GPT-4's explanations can beat LLaMA-65B on academic reasoning benchmarks. That is not a fluke of benchmark overfitting; it is the central promise of knowledge distillation carried into the LLM era. The gap between what a model "knows" and what raw supervised learning on human labels can teach it is bridged by using a stronger model as an oracle.
The core mechanism: soft labels carry more signal
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean formalised the idea in 2015. The insight was simple but profound: a neural network's output distribution over all classes (not just the argmax label) is itself a form of compressed knowledge. If a classifier outputs 0.7 for "cat", 0.2 for "lynx", and 0.1 for "dog", that distribution encodes structural relationships the one-hot label "cat" silently discards.
The student is trained to minimise a combined loss:
L_total = α · L_CE(y_hard, y_student) + (1 - α) · L_KL(y_soft_T, y_student_T)
where:
- y_hard is the one-hot ground-truth label
- y_soft_T is the teacher's softmax output at temperature T
- y_student_T is the student's softmax output at the same temperature
- α balances the two objectives (commonly 0.1-0.5)
Temperature T > 1 "softens" both distributions, redistributing probability mass away from the near-zero entries and giving the student a clearer gradient signal about which wrong answers the teacher considers plausible.
For LLMs the picture is the same at every token position. The teacher produces a distribution over the vocabulary; the student learns to match it via KL divergence rather than merely predicting the top token by cross-entropy on human text.
Three flavours of teacher-student transfer for LLMs
The mechanism is constant; what varies is what exactly you transfer from teacher to student.
| Transfer style | What the student sees | Representative work |
|---|---|---|
| Output tokens only | Decoded text from the teacher | Alpaca (Stanford, 2023) |
| Full soft distribution | Per-token logits or probabilities | Standard NLP distillation |
| Reasoning traces + explanations | Chain-of-thought + final answer | Orca (Microsoft, 2023) |
Output-token distillation is the simplest: run the teacher on a set of prompts, collect its decoded responses, fine-tune the student on those (prompt, response) pairs as if they were human demonstrations. The student never sees the token-level probability surface; it only sees one sample per prompt. This is cheap and widely used.
Soft-distribution distillation is richer. At each decoding step the teacher emits a full vocabulary distribution. The student is trained to minimise KL divergence against that distribution rather than cross-entropy against the single sampled token. This is expensive (the teacher must be run in tandem, and logits must be stored or streamed), but substantially more data-efficient per token because the student receives a gradient signal from every position in the vocabulary, not just the chosen one.
Reasoning-trace distillation goes further still. Rather than distilling just the final answer, the teacher generates step-by-step explanations, self-critiques, or explicit chain-of-thought. Orca (Mukherjee et al., 2023) trained a 13B model on 5 million GPT-4 explanation traces and showed it could match or exceed LLaMA-65B on a suite of reasoning benchmarks. The student learns not just to produce the same output but to internalise the problem-decomposition strategy. Orca 2 (2023) extended this by teaching the model to select the right strategy (step-by-step vs. direct recall vs. code execution) per task type.
Why the student can exceed what supervised learning alone would achieve
This seems paradoxical: the teacher was trained on the same pretraining corpus as the student. How does a smaller model learn something the bigger model "already knew" from the same data?
The answer is that base pretraining optimises a language modelling objective over enormous amounts of noisy, unstructured text. The teacher has implicitly organised latent representations and output distributions that make fine-grained distinctions much more legible than any human annotation scheme could encode. When the student fits the teacher's distribution, it is fitting a compressed, structured view of that knowledge rather than the noisy raw signal. The teacher has done the hard work of sifting pretraining noise into a coherent signal; the student merely needs to absorb it.
Put differently: the teacher is a better teacher than most human annotators for technical tasks because it has seen vastly more examples and has no annotation fatigue or inconsistency. Its soft label for "what comes after a partial differential equation derivation" is far more calibrated than a crowd-sourced label.
Practical recipe: distilling a student in the LLM era
A minimal working recipe:
-
Choose your teacher and budget. Larger teacher gap = larger potential gain, but also higher inference cost during data generation. GPT-4 or Claude 3.5 Sonnet as teacher; a 7B-13B open-weight model as student is the common pairing.
-
Generate the training set. Construct a diverse prompt distribution (instructions, few-shot problems, domain-specific tasks). Run the teacher at temperature 0 or low temperature for consistency. Collect full chain-of-thought if doing trace distillation.
-
Apply rejection sampling if quality is uneven. Generate k responses per prompt from the teacher, keep only those that pass a verifier (unit tests for code, a strong judge model for open-ended answers, exact-match for maths). This filters noise before the student ever sees the data.
-
Fine-tune the student. Standard supervised fine-tuning with cross-entropy on teacher tokens works. If you have access to the teacher's logits, add a KL divergence term. Learning rate ~1e-5, 2-3 epochs, cosine schedule. Watch for overfitting on the distillation set.
-
Evaluate on held-out benchmarks the teacher was not prompted for. This is the critical check: does capability transfer, or has the student merely memorised teacher phrasing?
When it falls down
Style imitation masking capability gaps. The most dangerous failure mode. Models trained purely on teacher outputs often sound like the teacher while lacking its depth. Human raters consistently rate such "imitation models" as competitive with the teacher; automated evaluations expose the gap on novel factual and reasoning tasks. Gudibande et al. (2023) formalised this as "The False Promise of Imitating Proprietary LLMs." The student learns discourse style (how the teacher writes) far more readily than factual correctness or deep reasoning (what the teacher knows).
Distribution mismatch collapse. The teacher's output distribution was not calibrated for the student's architecture, capacity, or pretraining distribution. The student fits the teacher's soft labels on the training prompts but generalises poorly because the teacher's probability surface is too complex for a much smaller model to approximate accurately across the full input space.
Data homogenisation leading to model collapse. If successive generations of models are distilled from each other rather than from humans or diverse data, the training distribution narrows. Variance in the teacher's outputs is systematically lost with each generation. Alemohammad et al. (2023) showed that self-consuming generative models degrade in diversity ("go MAD") without continual injection of fresh real-world data. This is the recursive distillation failure mode.
ToS and licence violations. Generating training data from closed commercial APIs (GPT-4, Claude) and using it to train a competing model almost universally violates the provider's terms of service. This matters practically: several distillation-trained open-source releases have been taken down. Check the terms before building a production pipeline.
Capacity mismatch and gradient saturation. If the student is far too small (e.g., a 1B model distilling from GPT-4), soft label entropy is high and the loss surface is dominated by probability mass the student simply cannot represent. Training becomes unstable or converges to a degenerate mode. A rough rule of thumb: the student should be no more than one order of magnitude smaller than the teacher in terms of effective parameter count; beyond that, the signal becomes noise.
Compounding errors in chain-of-thought traces. Even a very capable teacher makes errors in long reasoning chains. When those errors are absorbed into the student as ground truth, the student learns to confidently reproduce the teacher's failure modes. Filtering by final-answer correctness helps but does not catch cases where the reasoning is wrong and the answer coincidentally correct.
Further reading
- Hinton, Vinyals, Dean (2015) - "Distilling the Knowledge in a Neural Network": https://arxiv.org/abs/1503.02531
- Mukherjee et al. (2023) - "Orca: Progressive Learning from Complex Explanation Traces of GPT-4": https://arxiv.org/abs/2306.02707
- Gudibande et al. (2023) - "The False Promise of Imitating Proprietary LLMs": https://arxiv.org/abs/2305.15717
- Alemohammad et al. (2023) - "Self-Consuming Generative Models Go MAD": https://arxiv.org/abs/2307.01850