Foundations
Few-Shot and In-Context Learning
Learning a task from a handful of worked examples placed in the prompt, with no weight updates, and the surprising evidence about what those examples actually teach.
intermediate · 8 min read
Give a model three worked examples of turning a sentence into structured JSON, then a fourth sentence, and it produces the JSON. No gradients ran. No parameters changed. The task was specified entirely by text sitting in the prompt, and the specification evaporates the moment the context window is cleared. This is in-context learning (ICL): the model conditions its next-token prediction on demonstrations you place before the query, and behaves as if it had been trained for the task. GPT-3 is what turned this from a curiosity into a default way of programming language models.
The GPT-3 result that made ICL matter
Brown et al. (2020) trained a 175-billion-parameter autoregressive model and showed that "scaling up language models greatly improves task-agnostic, few-shot performance," often reaching competitive results against fine-tuned baselines without a single gradient update at test time. The paper drew a three-way distinction that is still the working vocabulary:
- Zero-shot: only a natural-language task description, no examples.
- One-shot: the description plus a single demonstration.
- Few-shot: the description plus K demonstrations (typically 10 to 100, bounded by the context window).
The headline was not that a model could do translation or arithmetic; smaller fine-tuned models already could. It was that one frozen set of weights could be steered into hundreds of tasks purely through the prompt, and that the few-shot gap over zero-shot widened as the model grew. Capability was latent in the pretrained weights; demonstrations were the addressing scheme that reached it.
What demonstrations actually teach (the counter-intuitive part)
The natural assumption is that few-shot examples work by showing the model correct input-output mappings, the way a supervised training set does. Min et al. (2022), "Rethinking the Role of Demonstrations," took that assumption apart. Their central result: randomly replacing the labels in the demonstrations barely hurts performance on a range of classification and multiple-choice tasks. A model shown examples with the wrong answers attached still classified the query nearly as well as one shown the right answers.
State that finding carefully, because it is easy to overstate. It does not mean labels never matter or that ICL is a parlour trick. It means the demonstrations are doing something other than teaching the mapping. Min et al. isolate what does carry the signal:
- The label space. Seeing the set of possible outputs (positive/negative, the class names, the answer format) tells the model which of its latent behaviours to activate.
- The input distribution. Examples drawn from the right kind of text prime the model on the domain.
- The format. The structural template, how input maps to output on the page, is what the model imitates.
So few-shot prompting is closer to task location than to task learning. You are pointing the model at a capability it already has and fixing the output shape, not fitting a function from the examples. The practical upshot: get the format, the class labels, and the example style right, and do not assume that hand-verifying every demonstration label buys you as much as intuition suggests. How much correctness matters is genuinely task-dependent (later work found stronger label effects on harder or more novel tasks), so treat "labels do not matter" as a caution against a false assumption, not a licence to be sloppy.
Order, recency, and majority bias
If demonstrations locate a behaviour, their arrangement should be a minor detail. It is not. Lu et al. (2021), "Fantastically Ordered Prompts and Where to Find Them," showed that the order of the same set of examples "can make the difference between near state-of-the-art and random guess performance." Same examples, same model, different permutation, wildly different accuracy. The sensitivity does not reliably shrink with model size, which rules out "just use a bigger model" as the fix.
Two biases drive much of this. Recency bias: labels appearing near the end of the prompt are over-weighted, so the last demonstration's class gets predicted too often. Majority-label bias: if the demonstrations are class-imbalanced, the model's predictions skew toward the majority class regardless of the query. Together they mean a few-shot prompt is a biased estimator whose bias you can partly design away: balance the classes, and either search for a good ordering (Lu et al. rank orderings using an entropy statistic on a synthetic dev set, no extra labels needed) or calibrate the output probabilities against a content-free input.
Which demonstrations you pick matters alongside how you order them. Retrieving examples semantically similar to the query, rather than using one fixed static set, is a reliable lever: nearest-neighbour selection over an embedding index consistently beats random demonstrations, because the input distribution and format then match the query more tightly.
A mechanistic aside: induction heads
Why does conditioning on examples work at all? One concrete mechanism has direct causal evidence. Olsson et al. (2022) at Anthropic argue that induction heads may account for the majority of in-context learning in transformers. An induction head implements a simple pattern-completion rule: having seen the sequence [A][B] earlier, when it later encounters [A] again it attends back and predicts [B], copying and completing patterns from earlier in the context. The evidence is unusually strong for interpretability work: induction heads form during a sharp phase change in training that coincides with a jump in in-context learning ability, perturbing the architecture to shift when they form shifts the ICL improvement to match, and ablating them at test time reduces ICL. Real few-shot prompting is richer than literal copying, but the same machinery that copies [A][B] generalises to softer pattern completion, which is a plausible substrate for the whole phenomenon.
ICL versus fine-tuning versus chain-of-thought
Three things get conflated; keep them separate.
| In-context learning | Fine-tuning | Chain-of-thought | |
|---|---|---|---|
| Changes weights? | No | Yes | No |
| Persists across calls? | No, lives in the prompt | Yes, baked in | No |
| What it does | Locates a task via examples | Fits new behaviour into parameters | Elicits intermediate reasoning steps |
| Marginal cost | Prompt tokens, every call | One-off training run | Extra output tokens, every call |
ICL and chain-of-thought are complementary, not alternatives: CoT is a style of demonstration or instruction ("show the reasoning before the answer") that you deliver through the very in-context channel ICL provides. Fine-tuning is the different beast; it is the tool when you need behaviour to persist, to survive a cleared context, to run cheaply at inference without re-paying the demonstration tokens, or to reach quality that prompting plateaus below.
When it falls down
- Brittleness to order and format. The same examples in a different order can swing accuracy from strong to chance (Lu et al.). Balance classes, calibrate, or search orderings; do not treat a single hand-written prompt as a stable measurement.
- It adds no new knowledge. ICL relocates capabilities already in the weights. If the pretrained model does not know a fact or cannot do an operation, no number of demonstrations will install it. That is fine-tuning's or retrieval's job.
- It burns context tokens on every call. Twenty demonstrations of a few hundred tokens each are re-sent and re-attended for every single query, forever. At scale, fine-tuning the behaviour in once is far cheaper than paying the prompt tax indefinitely.
- It plateaus below fine-tuning on hard tasks. For tasks needing precise, consistent, novel behaviour, few-shot prompting hits a ceiling that a fine-tune clears. ICL is the fast first move, not always the final one.
- The correct-label caveat is task-dependent. Min et al. show labels can be corrupted with little damage on many classification tasks, but do not overgeneralise: on harder or more unusual tasks correct demonstrations matter more. Treat it as evidence that format and label space carry much of the load, not as permission to feed the model wrong answers.
Further reading
- Language Models are Few-Shot Learners - Brown et al. (2020), the GPT-3 paper that established zero/one/few-shot ICL and its scaling behaviour.
- Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? - Min et al. (2022), the label-space/input-distribution/format finding and the random-label result.
- Fantastically Ordered Prompts and Where to Find Them - Lu et al. (2021), prompt-order sensitivity and an entropy-based method for finding good orderings.
- In-context Learning and Induction Heads - Olsson et al. (2022), the mechanistic case that induction heads drive much of ICL.