The RNN-Transducer

CTC shipped the first practical end-to-end ASR, but it carried a silent flaw: at every output step, the model assumed the emitted labels were conditionally independent of each other given the acoustics. Speak the phrase "New York" and CTC must predict "York" from the audio alone, blind to the "New" it just output. That limitation is especially painful for morphologically rich languages and for rare proper nouns where language-model context is the only signal distinguishing likely completions.

Alex Graves's 2012 paper introduced the RNN-Transducer (RNN-T) to close that gap. The key insight is simple to state: give the output predictor its own recurrent network that reads the label history, and make the entire system differentiable end-to-end. The practical consequence is that a single neural model can do acoustic modelling and language modelling jointly, with no handcrafted lexicon, no separate LM, and no requirement to see the whole utterance before emitting the first token. That last property - streaming - is what took RNN-T from a research curiosity in 2012 to the dominant on-device ASR engine by 2019, powering Google's Pixel speech stack.

The three networks and what they do

RNN-T is composed of three learnable modules:

Encoder (also called the transcription network). Takes the acoustic feature sequence \(x_1, \dots, x_T\) (typically 80-dim log-mel frames) and produces a frame-level encoding \(h^{enc}_t\). Any sequence model works here: LSTM, Transformer, or Conformer. The encoder sees one direction of time during streaming (causal); full-context encoders are used when latency allows.

Prediction network (also called the label encoder). A recurrent net that reads the previous non-blank output label \(y_{u-1}\) and produces \(h^{pred}_u\). This is the component CTC lacks. It gives the model a learned prior over what label is likely to follow, functioning roughly as an implicit language model.

Joiner (the joint network). Combines one encoder state and one predictor state into a distribution over the output vocabulary plus a special blank token:

\[P(k \mid t, u) = \text{softmax}\bigl(W \cdot \tanh(h^{enc}_t + h^{pred}_u)\bigr)\]

where \(k \in \{\text{blank}, y_1, \dots, y_V\}\).

The full model must therefore be thought of as operating on a 2-D lattice. One axis indexes acoustic time \(t\) (1 to \(T\)), the other indexes label position \(u\) (0 to \(U\)). At every lattice node \((t, u)\), the model either emits a label (advancing \(u\), keeping \(t\) fixed) or emits blank (advancing \(t\), keeping \(u\) fixed). The final output is the sequence of non-blank labels along any valid path through the lattice.

Training: the RNN-T loss and why it is expensive

Like CTC, RNN-T training marginalises over all valid alignments. The probability assigned to target sequence \(y^*\) is:

\[P(y^* \mid x) = \sum_{\pi \in \mathcal{B}^{-1}(y^*)} \prod_{(t,u)} P(\pi_{t,u} \mid t, u)\]

where \(\mathcal{B}\) collapses blank tokens to recover the label sequence. The forward-backward algorithm for this sum runs over the full \(T \times U\) grid, so training requires materialising a tensor of shape \((B, T, U, V)\) in GPU memory, where \(B\) is batch size, \(T\) can be 1000+ frames, \(U\) can be 100+ labels, and \(V\) is the vocabulary size. For a batch of 32 utterances with \(T=500\), \(U=60\), \(V=4096\) in float32, that is roughly 15 GB per batch - a genuine bottleneck that kept RNN-T out of large-batch training for years.

The three networks and what they do

Training: the RNN-T loss and why it is expensive

Keep reading with Pro.