← Concept library

Vision & Multimodal

Language-Model Fusion

Language-model fusion techniques inject text-only knowledge into end-to-end ASR models at inference time or training time, and the correct method depends on how much implicit language bias the acoustic model has already absorbed.

advanced · 8 min read · Premium

End-to-end ASR models - RNN-T, attention encoder-decoder, CTC - are trained on paired audio-text data. There is rarely enough of it. A well-trained language model, by contrast, has seen orders of magnitude more text. The question is not whether to use that LM; the question is how to combine two probability distributions that were never optimised against each other without letting one drown out the other.

The Shallow Fusion Baseline

Shallow fusion is the simplest possible answer: at beam search time, add the log-probability of an external LM to the log-probability of the acoustic model, weighted by a tunable scalar.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied