← Concept library

Vision & Multimodal

Causal and Chunked Attention for Streaming

Streaming ASR requires attention mechanisms that never look at future audio; causal masking and chunked attention are the two principal techniques, each trading latency against accuracy in different ways.

intermediate · 8 min read

A Transformer encoder trained on full utterances attends to every frame when computing the representation of any given frame. That is fine for batch transcription of a recorded lecture, but it is fatal for a real-time voice assistant: the model cannot produce a word until the entire sentence has arrived. The practical constraint is hard: voice interfaces with more than roughly 300 ms of end-to-end latency feel broken to users. Getting a full-attention Transformer under that ceiling requires restructuring the attention mechanism itself.

Why Standard Attention Cannot Stream

Standard scaled dot-product attention computes, for each query frame \(t\), a weighted sum over all key-value pairs in the sequence:

\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V\]

During inference on a stream, frame \(t\) has not arrived yet when you are computing the representation of frame \(t-5\). Using it anyway would be acausal - the system would be peeking into the future. Beyond causality, even if you waited, the quadratic \(O(N^2)\) cost over an ever-growing audio stream is prohibitive.

There are two separate problems here: the causality problem (correctness) and the unbounded context problem (efficiency). The two main families of solutions address them differently.

Causal Masking: The Simplest Fix

The most direct solution applies a triangular mask to the attention matrix so that each query can only attend to past and present keys:

mask[i, j] = 0   if j <= i   (attend)
mask[i, j] = -inf if j > i   (block)

After the softmax, future positions receive zero weight. This is identical to the autoregressive masking used in decoder-side attention in sequence-to-sequence models, except here it is applied to the encoder.

Strictly causal attention is correct and streamable: as each new frame arrives you can compute its representation immediately. The cost is accuracy. A Conformer or Transformer encoder trained with full bidirectional context uses right-context frames to disambiguate phonemes that sound alike at their onset (for example, "s" vs "sh" - the distinction is partly in the following vowel). Forcing the encoder to be strictly causal removes that disambiguating information and typically adds 5-15% relative word error rate (WER) compared to the same architecture trained with full context, depending on the language and model size.

A common compromise is limited right context: allow each query frame to attend \(L\) frames into the future (a small, fixed lookahead), then wait \(L\) frames before emitting the representation. This adds a deterministic latency of \(L \times \text{frame\_shift}\) (often 40-80 ms with a 10 ms shift and \(L=4\)-8) but recovers much of the accuracy lost by strict causality. The Transformer Transducer paper by Zhang et al. (2020) demonstrated this trade-off systematically on LibriSpeech: a two-frame lookahead closed most of the accuracy gap with full-context attention at the cost of 20 ms of latency.

Chunked Attention: Context Windows Without Full History

Causal masking still faces the growing-context problem: after 10 seconds of speech, each frame attends over 1000 past frames, and the KV cache grows unboundedly. Chunked attention solves this by dividing the audio stream into non-overlapping (or slightly overlapping) segments, called chunks, and restricting attention to within-chunk frames plus a fixed number of recent past frames.

The canonical pattern looks like this:

Chunk boundaries: [0..C-1], [C..2C-1], [2C..3C-1], ...

For chunk k (frames kC to (k+1)C - 1):
  - Attend to all frames within the current chunk (full softmax)
  - Attend to the last M frames of the previous chunk (fixed memory)
  - Block all other past and all future frames

This gives \(O(C^2 + C \cdot M)\) attention per chunk rather than \(O(N^2)\) over the full stream, with \(C\) and \(M\) being hyperparameters chosen based on latency and accuracy targets. A typical configuration might use \(C = 40\) frames (400 ms) and \(M = 16\) frames (160 ms of left-context carryover).

The latency of chunked attention is at minimum one full chunk duration: you must buffer \(C\) frames before processing them. This is the fundamental latency floor. Smaller chunks reduce latency but reduce the context available during the softmax, harming accuracy. Larger chunks improve accuracy but increase latency. In practice, chunk sizes of 320-640 ms are common for production streaming ASR.

Emformer (Shi et al., 2021) is a notable variant that compresses the left context into a fixed-size augmented memory bank using a summary vector, rather than storing raw frames. This keeps the KV cache bounded at inference time while allowing the model to access a longer effective left context during training. Emformer achieved competitive LibriSpeech results at around 960 ms average latency with a 4.6x training speedup compared to naive chunk-based baselines.

Monotonic Chunkwise Attention (MoChA), introduced by Chiu and Raffel (2018), approaches the problem differently: instead of hard chunk boundaries, it learns a soft monotonic selection process that decides when to move forward in the input. The model emits an output token after attending softly over a small window ending at the current selection point. This is closer to the attention mechanism in RNN-T in spirit, and it decouples the output step rate from fixed chunk sizes. At test time, MoChA decodes in linear time, avoiding the quadratic cost of standard soft attention.

Training vs. Inference Alignment

A subtle but important issue: a model intended for streaming inference must be trained with the same attention mask it will use at inference. Training with full bidirectional attention and then applying a causal mask at inference produces a distribution mismatch - the model has never learned to operate without right context, so its representations in "streaming mode" are out-of-distribution.

This means you cannot simply take a pretrained full-context Whisper model (Radford et al., 2022) and stream it naively by masking future frames. Whisper was trained to encode 30-second segments with full attention. Practical streaming deployments of Whisper instead chunk the audio at segment boundaries and batch-encode each completed segment, accepting the latency cost. True streaming Whisper requires retraining with a causal or chunk-based attention scheme.

A practical training recipe for chunked streaming models:

  1. Use chunk-simulated masking during training: randomly sample chunk boundaries and apply the chunk-restricted mask to every mini-batch.
  2. Include a small proportion of full-context examples (or a warm-start from a full-context model) to stabilise early training.
  3. Validate on both streaming and batch metrics; they can diverge significantly.

Transformer Transducer architectures (Zhang et al., 2020) integrate this naturally because the RNN-T loss is frame-synchronous by construction, providing a strong gradient signal for the encoder to produce useful representations even with limited right context.

When It Falls Down

Chunk-boundary artefacts. Phonemes and words that straddle a chunk boundary get representations computed from incomplete local context. If the chunk boundary happens mid-phoneme, the attention window on either side may be too narrow to correctly classify it. Using a small overlap between chunks (sending the last \(O\) frames of chunk \(k\) as prefix context for chunk \(k+1\)) mitigates this at the cost of redundant computation.

Long-range dependencies. Some ASR phenomena require long context - for example, anaphora in dictation, speaker diarisation, or punctuation recovery. A hard chunk boundary with a small memory carryover will drop this context. Hierarchical models (a streaming encoder feeding a non-causal re-scoring pass) can help but add latency.

Memory bank quality in Emformer-style models. The compressed memory summary is trained end-to-end, but its capacity is fixed. If the compressed representation discards phonetically relevant features, the accuracy loss relative to raw-frame left context can be non-trivial and hard to diagnose.

Training instability with small chunks. With very small chunks (\(C < 8\) frames), the attention windows are so narrow that training gradients become noisy. This often manifests as unusually slow convergence or sensitivity to learning-rate warmup schedule length.

Latency-accuracy Pareto curve. There is no free lunch: every reduction in chunk size (lower latency) costs WER, and the shape of the Pareto curve depends heavily on the language, speaking rate, and acoustic environment. A configuration tuned for English read speech at 200 ms latency may degrade much more sharply on fast conversational Mandarin.

Further Reading

Sign in to save and react.
Share Copied