Language-Model Fusion

End-to-end ASR models - RNN-T, attention encoder-decoder, CTC - are trained on paired audio-text data. There is rarely enough of it. A well-trained language model, by contrast, has seen orders of magnitude more text. The question is not whether to use that LM; the question is how to combine two probability distributions that were never optimised against each other without letting one drown out the other.

The Shallow Fusion Baseline

Shallow fusion is the simplest possible answer: at beam search time, add the log-probability of an external LM to the log-probability of the acoustic model, weighted by a tunable scalar.

score(y | x) = log P_AM(y | x) + λ · log P_LM(y)

P_AM is the acoustic model score (attention, RNN-T, or CTC output); P_LM is the LM score; λ is found by grid search on a held-out set, typically in the range 0.1 to 0.4.

This works surprisingly well. The 2018 Toshniwal et al. comparison study found shallow fusion to be the strongest single-pass method across both Switchboard and large-scale Google voice-search data. No retraining required. You can swap LMs without touching the AM. The weakness is that it treats acoustic and language scores as independent, which they are not - the AM already learned something about word sequences from its training transcripts.

The Internal Language Model Problem

Every seq2seq or transducer model trained on paired data develops an implicit language model inside itself. The decoder has seen the training transcripts and has absorbed their n-gram statistics. When you add an external LM on top, you are effectively double-counting that prior: once from the AM decoder, once from the LM.

For in-domain data the double-counting is harmless or even helpful. For out-of-domain data (a medical ASR system being fine-tuned on legal text, for instance) the internal prior actively fights the new LM. The net effect can be worse than no LM at all.

Internal Language Model Estimation (ILME), introduced by Meng et al. (2021), fixes this by subtracting the internal LM score:

score(y | x) = log P_AM(y | x)
             + λ · log P_LM(y)
             - μ · log P_ILM(y)

P_ILM(y) is estimated by running the AM decoder with a zeroed or masked acoustic input - the model then conditions only on its own previous predictions, revealing the language prior it has learned. Subtracting this (with weight μ) cancels the double-counting before the external LM is added. In Meng et al.'s experiments on models trained on 30,000 hours of Microsoft speech, ILME reduced WER by up to 15.5% relative versus standard shallow fusion on out-of-domain LibriSpeech data, with no additional training required.

The follow-up work (Meng et al., ICASSP 2021) went further: during training, a separate ILM loss is minimised to make the internal prior more explicit and easier to subtract, yielding up to 31.5% relative WER reduction. The idea is to make the implicit thing explicit so you can cancel it cleanly.

The Shallow Fusion Baseline

The Internal Language Model Problem

Keep reading with Pro.