Sampling and Decoding

A language model never emits a token. At each step it emits a probability distribution over the entire vocabulary, and a separate piece of code, the decoder, chooses what to actually write. Swap that decoder and the same frozen weights swing from flat, looping boilerplate to vivid prose to confident nonsense. This stage gets far less attention than architecture or training, yet it is where a depressing share of "the model is bad at X" complaints actually live. Understanding it is the difference between tuning two numbers and re-running a fine-tune you did not need.

Greedy and beam search

The simplest decoder is greedy: take the argmax at every step. It is fast and deterministic, and for short, high-confidence outputs (a classification label, a function name) it is often exactly right. Its weakness is myopia. The locally most probable token can lead into a globally poor sentence, and there is no way back.

Beam search keeps the k most probable partial sequences alive and expands all of them, returning the highest-probability full sequence at the end. It dominates in machine translation and other tasks with one mostly-correct answer. For open-ended generation it backfires: maximising sequence probability drives the model toward bland, repetitive, high-likelihood text, the exact degeneration that makes beam output read like a hostage note. High probability is not the same as high quality, and on creative tasks the two actively diverge.

Temperature: reshape before you sample

Once you decide to sample rather than maximise, temperature T rescales the logits before the softmax: divide every logit by T, then normalise.

p_i = softmax(logit_i / T)

T < 1 sharpens the distribution toward the top tokens (more deterministic, more conservative); T > 1 flattens it (more diverse, more risk); T -> 0 collapses to greedy. Temperature alone is a poor safety mechanism, though. Raise it for diversity and you also inflate the long tail of genuinely bad tokens, so you get creativity and incoherence together. That tension is why temperature is almost always paired with a truncation step.

Truncation: cut the tail, then sample

Method	Keeps	Behaviour
top-k	the `k` highest-probability tokens	fixed-size shortlist; too rigid when the distribution is very peaked or very flat
top-p (nucleus)	the smallest set whose cumulative probability exceeds `p`	adapts its size to model confidence; the long-standing default
min-p	tokens with probability at least `min_p * p_max`	threshold scales with the top token; stays sane at high temperature

Top-k keeps a fixed number of candidates and samples among them. It is simple but mis-sized by construction: when the model is certain, k=50 drags in 49 bad options; when it is unsure, k=50 may chop off good ones. Nucleus sampling (top-p), introduced in "The Curious Case of Neural Text Degeneration", fixes the size problem by keeping the smallest set of tokens whose probabilities sum past p (say 0.9), so the shortlist grows when the model is uncertain and shrinks when it is confident. Min-p is the more recent refinement: it sets a floor relative to the most likely token, min_p * p_max, which keeps the candidate pool coherent even at high temperature, where top-p can let improbable tokens slip in. The practical upshot is that min-p lets you crank temperature for genuine creativity without the output falling apart.

Repetition control and structure

Even well-truncated sampling loops ("the the the", restated sentences). Repetition and frequency/presence penalties subtract from the logits of tokens already seen, weighted by count or mere presence, to break the cycle. Set them too high and the model starts avoiding necessary words (articles, a subject's name) and the text turns evasive. For tasks that demand a rigid output shape, constrained decoding masks the logits to only those tokens a grammar permits, guaranteeing valid JSON or a value from a fixed set; that is a deeper topic in its own right, on the serving side.

When it falls down

Reasoning models want low temperature. For chain-of-thought and code, sampling noise compounds across a long generation; reasoning-tuned models are typically run near-greedy. High temperature here trades correctness for nothing. Self-consistency deliberately re-introduces sampling, drawing several diverse chains and voting, which is the exception that proves the rule.
Penalties versus correctness. Repetition penalties punish any reuse, including legitimately repeated identifiers in code or a name that must recur. The fix is usually a smaller penalty, not a larger one.
Defaults travel badly. A temperature and top-p tuned for chat will throttle a brainstorming task and destabilise a structured-extraction one. Decoding settings are task settings, not global ones.
Determinism is not guaranteed. Even greedy decoding can vary across hardware and batch sizes because floating-point reductions reorder (see numerical-computation-gotchas). "Temperature 0" reduces variance; it does not promise bit-identical output.

Greedy and beam search

Temperature: reshape before you sample

Truncation: cut the tail, then sample

Repetition control and structure

When it falls down

Further reading