← Concept library

Vision & Multimodal

The RNN-Transducer

The RNN-Transducer is a fully neural, streaming-capable sequence transduction model that replaces CTC's conditional independence assumption with a learned label-context network, enabling accurate on-device speech recognition.

advanced · 9 min read · Premium

CTC shipped the first practical end-to-end ASR, but it carried a silent flaw: at every output step, the model assumed the emitted labels were conditionally independent of each other given the acoustics. Speak the phrase "New York" and CTC must predict "York" from the audio alone, blind to the "New" it just output. That limitation is especially painful for morphologically rich languages and for rare proper nouns where language-model context is the only signal distinguishing likely completions.

Alex Graves's 2012 paper introduced the RNN-Transducer (RNN-T) to close that gap. The key insight is simple to state: give the output predictor its own recurrent network that reads the label history, and make the entire system differentiable end-to-end. The practical consequence is that a single neural model can do acoustic modelling and language modelling jointly, with no handcrafted lexicon, no separate LM, and no requirement to see the whole utterance before emitting the first token. That last property - streaming - is what took RNN-T from a research curiosity in 2012 to the dominant on-device ASR engine by 2019, powering Google's Pixel speech stack.

The three networks and what they do

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied