Streaming vs Offline ASR

Every time you dictate a voice message and watch words appear before you finish speaking, you are watching a streaming ASR system make predictions without knowing what comes next. Every time a captioning service processes a recorded lecture overnight and achieves near-human accuracy, you are seeing the payoff of having the full audio in hand. The choice between the two modes is not a matter of preference; it is a fundamental architectural constraint that propagates through every layer of the system, from the encoder's receptive field to the decoder's search strategy.

The Core Asymmetry: Causal vs. Non-Causal Context

The single most important difference is access to future frames.

A non-causal (offline) encoder - a standard Transformer or Conformer - computes attention over the entire input sequence. The representation of frame t is informed by frames at t+100 just as easily as frames at t-100. That bidirectional context is enormously useful: the word "read" is pronounced differently depending on whether the next word is "books" or "yesterday", and the full-utterance encoder resolves such ambiguities cleanly.

A causal (streaming) encoder may only attend to frames at positions ≤ t plus a bounded look-ahead window. Formally, if the encoder produces hidden states h_1:T, the streaming constraint is:

h_t = f(x_{t-L : t+R})   where R << T

L is the left context (effectively unlimited in most designs), R is the right-context (look-ahead). Larger R reduces word error rate but increases latency. A typical production choice is R = 4-8 frames (80-160 ms at 10 ms frame shift).

The consequence is not subtle. On LibriSpeech test-clean, a non-causal Conformer-Transducer reaches around 2% WER; its causal counterpart with 640 ms latency budget typically sits 10-20% relative worse, i.e., roughly 2.2-2.4% WER. The gap widens on noisy or spontaneous speech.

Architectures That Actually Work in Each Mode

Offline

Whisper (Radford et al., 2022) is the canonical example. Its encoder is a non-causal Transformer operating over mel-spectrograms padded to a fixed 30-second window. The decoder is autoregressive but, crucially, the encoder has seen every frame before the decoder runs. Whisper is trained on 680,000 hours of weakly-labelled web audio and achieves competitive zero-shot WER on many benchmarks. It is not streamable in its published form; the 30-second fixed context means you must buffer audio before transcription begins.

Attention encoder-decoder (AED) models with a full Conformer encoder share the same offline nature. Because the decoder cross-attends to all encoder states, any streaming variant requires complicated masking or re-scoring.

Streaming

RNN-T (Graves, 2012) was designed from the start as a streaming architecture. The prediction network (a recurrent language model) and the encoder run independently; the joiner network combines them frame by frame. The key property is that the transducer can emit a blank token at every frame and delay emitting real tokens until it has enough evidence. RNN-T dominates production on-device ASR: Google's Pixel voice typing and Android's Speech Recognizer have both been built around RNN-T variants.

Streaming Conformer-Transducer replaces the RNN encoder with a Conformer using chunk-wise attention. Chunks of 160-640 ms are processed with left context cached from prior chunks. The Transformer-XL-style cache means the model retains long-range context across chunks without recomputing earlier frames.

A practical comparison:

Property	Offline (Whisper-style)	Streaming (RNN-T-style)
Encoder context	Full utterance	Causal + bounded look-ahead
First-word latency	Seconds to full buffer	Tens to hundreds of ms
WER on LibriSpeech clean	~2-3% (large models)	~2.4-4% (comparable scale)
On-device viability	Hard (large buffer, high RAM)	Yes; used in production
Language model integration	N-gram rescoring or LM fusion	Shallow fusion, ILME
Typical use case	Transcription, subtitling, translation	Voice commands, dictation, live captions

Latency Accounting

Latency in streaming ASR has three additive components:

Algorithmic latency: the right-context look-ahead R frames plus any chunk boundary delay.
Acoustic model latency: the time for the model to process the current chunk on the target hardware.
Decoder latency: beam search or greedy decoding over the transducer lattice.

The perceived latency - what the user notices - is the delay from speaking a phoneme to seeing it appear on screen. Targets for voice assistants are typically under 300 ms end-to-end; captioning systems tolerate 1-2 seconds for a smoother visual experience.

Offline models add a fourth component: buffering latency, the time spent waiting for enough audio. Whisper's 30-second window means a user hears no transcript for up to 30 seconds, which is obviously unsuitable for live dictation. Partial-decoding hacks exist (run Whisper on sliding windows) but they introduce inconsistencies and hallucinations at window boundaries.

Decoding Differences

Offline systems can run full beam search with a language model over the entire decoded sequence, applying techniques like length normalisation and coverage penalties that presuppose knowing when the utterance ends.

Streaming decoders must commit to tokens as they go. RNN-T uses a modified beam search that maintains a beam of partial hypotheses and prunes aggressively. A common problem is emission delay: the model may hold off emitting a token for many frames waiting for supporting evidence, causing words to "pop in" together rather than appear smoothly. Techniques like FastEmit (Yu et al., 2021) add a regularisation term that rewards earlier emission.

Another complication: endpointing. An offline system is handed a pre-segmented audio file. A streaming system must decide when the speaker has stopped. Endpointing errors (cutting too early or too late) directly damage WER and perceived latency. Most production systems train a lightweight classifier that triggers on silence plus acoustic features.

When It Falls Down

Streaming systems struggle with long-range dependencies. The bounded right-context means phenomena like cross-sentence co-reference, long compound words in German, or sentences where the verb comes last (SOV languages such as Japanese, Korean, Turkish) all suffer. Chunk-wise models mitigate this with cached left context but the mitigation is partial.

Offline systems are unusable for live speech. Any application requiring sub-second response - voice assistants, live captions, phone transcription - cannot wait for a 30-second buffer. Whisper cannot be used out-of-the-box for these cases.

Streaming models are brittle at chunk boundaries. If a word straddles two chunks, the model may split or mangle it. The choice of chunk size involves a trade-off: small chunks minimise latency but increase boundary errors; large chunks reduce boundary errors but reintroduce buffering latency.

Offline rescoring cannot fix streaming commits. If a streaming model emits the wrong word early and the user says a disambiguating word 500 ms later, the system has already displayed an error. Some systems add a "correction" mechanism (re-decoding with a shifted window), but this creates a flickering display that users find disconcerting.

Both modes fail on domain mismatch. A model trained on clean read speech will have elevated WER on accented, spontaneous, or noisy speech regardless of streaming mode. This is an acoustic modelling problem, not a streaming problem, but it is the most common production complaint.

Whisper hallucinates on silence. Because it always tries to fill its 30-second window with tokens, running it on near-silence produces confident nonsense. Streaming transducers, by contrast, emit blanks on silence and are naturally robust to this case.