← Concept library

Vision & Multimodal

Beam Search Decoding in ASR

Beam search decoding navigates the exponentially large label sequence space in ASR by maintaining a fixed-width frontier of the most probable partial hypotheses at each step, making it the default inference strategy for CTC, RNN-T, and attention-based models.

intermediate · 8 min read

Picking the single most probable character at each timestep - greedy decoding - gets you to perhaps 80-90% of what a model can theoretically achieve. The remaining gap, which can be several percentage points of word error rate on challenging test sets, comes from sequences that start with a locally unlikely token but end up globally more probable. Beam search exists to close that gap without enumerating every possible transcript.

The core idea: keeping a shortlist of live hypotheses

A greedy decoder commits irrevocably at every step. Beam search instead keeps the B most probable partial hypotheses - the beam - and expands each one at every timestep, scoring all possible extensions, then pruning back to B survivors.

Concretely, at step t you have B hypotheses each carrying a log-probability score. Each hypothesis is extended by every token in the vocabulary V, producing up to B × V candidates. You sort by score and keep the top B. The beam width B controls the compute-quality trade-off: B = 1 reduces to greedy; B = ∞ recovers exact search (impractical for real vocabularies).

For a vocabulary of size V over T timesteps, the true search space is V^T. Beam search explores B × V × T nodes: a linear reduction at the cost of a heuristic bound.

initialise beam = { (log_prob=0.0, hypothesis=[]) }
for each timestep t = 1..T:
    candidates = []
    for each (score, hyp) in beam:
        for each token y in V:
            new_score = score + log P(y | hyp, acoustic_features[t])
            candidates.append( (new_score, hyp + [y]) )
    beam = top-B(candidates)  # prune
return argmax(score) over beam

The exact semantics of log P(y | hyp, acoustic_features[t]) differ by model family, which is where things get interesting.

Connectionist Temporal Classification (CTC) emits a label or blank at every acoustic frame. A naive beam over frame-level outputs would conflate distinct alignments that collapse to the same label sequence - "aa_b" and "a_ab" both decode to "ab" (using underscore for blank). Prefix beam search, introduced alongside the deep speech line of work, merges these into a single hypothesis and accumulates their probabilities:

\[p(\mathbf{y}, t) = p_{\text{blank}}(\mathbf{y}, t) + p_{\text{non-blank}}(\mathbf{y}, t)\]

At each frame, for each prefix in the beam, you track two quantities: the probability that the prefix ends with a blank, and the probability that it ends with a non-blank label. Extending the prefix requires careful case analysis (extending with the same label vs. a new label vs. blank) to avoid double-counting alignments. This book-keeping is the distinguishing algorithmic feature of CTC decoding versus vanilla beam search.

Language model shallow fusion

A purely acoustic beam search will make errors on rare words and proper nouns. Shallow fusion splices in a separately trained language model by combining scores at each step:

\[\log P_{\text{combined}}(y_t \mid \mathbf{y}_{<t}) = \log P_{\text{acoustic}}(y_t \mid \mathbf{y}_{<t}, \mathbf{x}) + \lambda \log P_{\text{LM}}(y_t \mid \mathbf{y}_{<t})\]

The weight lambda (typically 0.1 to 0.5) is tuned on a validation set. When the acoustic model emits subword units (BPE or wordpiece tokens), the LM must operate at the same granularity; using a word-level LM requires accumulating subword scores until a word boundary is reached before applying the LM bonus.

Hannun et al. (2019) report that coupling a beam search decoder with a convolutional language model yields more than 22% relative word error rate reduction over the best previously reported sequence-to-sequence results on the noisy LibriSpeech test set, illustrating how much a well-integrated LM contributes beyond the acoustic signal alone.

Decoding strategy LM integration Typical WER reduction vs. greedy
Greedy None baseline
Beam (no LM) None 1-3% relative
Beam + shallow fusion External n-gram or neural LM 5-25% relative
Beam + deep fusion LM hidden state fused into decoder Similar to shallow; more complex

Beam search for attention-based and RNN-T models

Attention encoder-decoder models (such as Listen, Attend and Spell and Whisper) generate one output token per decoding step rather than one per acoustic frame. Standard beam search applies directly: the decoder cross-attends to the full encoder output and predicts the next token. The beam width used in practice is small (Whisper defaults to B = 5), but even this modest width meaningfully improves over greedy on out-of-domain speech.

RNN-Transducer (RNN-T) decoding is harder because the model factorises over both time and label dimensions. The output network runs at every (time, label) pair, and the beam must be maintained jointly. Practitioners often use a simplified "alignment-restricted" beam that constrains how far ahead in time a hypothesis can advance, reducing compute while preserving most of the quality gain.

When it falls down

Length bias. Log-probability scores accumulate over time, so longer hypotheses naturally receive lower (more negative) scores. Length normalisation - dividing by the number of tokens - partially corrects this but can over-penalise short transcripts in noisy conditions.

Beam diversity collapse. With small beam widths and a high-confidence acoustic model, all B hypotheses often converge to the same prefix within a few steps. Diverse beam search techniques (grouping hypotheses, applying penalties for repeated tokens) exist but add complexity.

Mismatch at word boundaries in subword models. Shallow fusion with a word-level LM requires waiting until a full word is assembled from subword tokens before applying the LM score. This introduces latency and can cause pruning to discard correct subword prefixes before the full word is scored.

Streaming constraints. Standard beam search is non-causal: it can revisit and rescore frames. For streaming ASR (live transcription with low latency), the beam must operate on a fixed lookahead window. CTC with a short lookahead can still run beam search, but attention models require chunk-based processing that complicates the beam state.

Computational cost scaling. Cost grows as O(B × V × T). For character-level models V is small (~30-100), making beam search cheap. For large vocabulary word-level systems V can be tens of thousands; practical decoders use lexicon constraints and hash-based prefix trees to avoid scoring every token at every step.

Score miscalibration. If the acoustic model is overconfident (common in models fine-tuned on narrow domains), the LM weight lambda found on general validation data may be too low, and shallow fusion gains little. Calibration or temperature scaling before decoding helps.

Further reading

Sign in to save and react.
Share Copied