Vision & Multimodal
Streaming TTS and Latency
Streaming TTS pipelines generate and deliver audio incrementally to cut time-to-first-audio from several seconds to under 300 ms, but doing so imposes hard trade-offs on chunk size, model architecture, and prosodic coherence.
intermediate · 7 min read
A voice assistant that waits until the entire response has been synthesised before playing a single sample is unusable. In practice, users tolerate roughly 200-300 ms of silence before perceiving a system as broken. For a modern neural TTS pipeline generating a 10-second utterance, batch synthesis can take 1-3 seconds on a GPU and far longer on CPU. Streaming is not optional polish; it is the difference between a product and a prototype.
What "streaming" actually means in TTS
Streaming TTS means the system begins producing and delivering audio frames before the full text has been processed. This happens at two separate stages that are often conflated:
Text-side streaming. In a conversational assistant, the LLM generating the reply is itself streaming tokens. The TTS engine must decide how to segment that incomplete token stream into synthesis units. Splitting on sentence boundaries is the simplest heuristic: the synthesiser begins on the first sentence while the LLM is still producing the second. More aggressive policies split on clause boundaries or even short phrases. The trade-off is prosody: a sentence spoken without knowing what comes next may carry the wrong intonation.
Audio-side streaming. Even when the full text is known, the neural model can be structured to emit audio in chunks rather than in a single pass. A vocoder operating on mel-spectrogram frames naturally produces audio at a fixed frame rate (typically 12.5 ms per frame at 80 frames/second). If the acoustic model can emit spectrogram frames incrementally, the vocoder can start decoding them while the rest of the sequence is still being predicted.
Latency anatomy: the three numbers that matter
Three numbers characterise streaming TTS performance:
| Metric | Definition | Typical target |
|---|---|---|
| Time-to-first-audio (TTFA) | Wall-clock from start of synthesis to first decoded sample | < 250 ms |
| Real-time factor (RTF) | Synthesis time / audio duration | < 0.5 |
| Buffering latency | Audio delivered ahead of playback cursor | 80-200 ms |
RTF below 1.0 means the system produces audio faster than it is consumed; the playback buffer stays full. RTF above 1.0 is catastrophic for streaming because the buffer empties and the listener hears dropouts. TTFA is the metric users actually perceive; RTF governs whether TTFA stays low across the whole utterance.
A clean way to write the pipeline constraint is:
TTFA = T_chunk_decision + T_first_chunk_synthesis + T_network_buffer
Each term is independently optimisable. T_chunk_decision is minimised by triggering on short phrases rather than full sentences. T_first_chunk_synthesis depends on model architecture. T_network_buffer is an audio-player constant, typically 80-150 ms.
Architecture choices that control streaming behaviour
Autoregressive models and the codec-token bottleneck
VALL-E and its successors treat TTS as language modelling over discrete codec tokens (see concept C14 on VALL-E). The EnCodec codec used in that work is itself described as a streaming encoder-decoder architecture and operates in real time. However, the codec language model on top is autoregressive: each token conditions on all previous tokens. For a 10-second utterance at 75 tokens/second that is 750 sequential generation steps before even one frame of audio can be handed to the decoder, unless the system streams at the token level.
Streaming from an autoregressive codec-LM requires:
- Committing to a chunk size in codec-token frames (e.g. 25 tokens = 333 ms of audio).
- Running the vocoder decoder on each completed chunk immediately.
- Accepting that the model has no look-ahead; prosodic decisions are made locally.
BASE TTS (Amazon, 2024) explicitly addresses this by using a convolution-based decoder that converts discrete speech codes into waveforms in an "incremental, streamable manner", decoupling waveform synthesis latency from codec-token generation.
Non-autoregressive models: lower TTFA, harder to stream
FastSpeech-style non-autoregressive models (C06) generate the full spectrogram in one parallel pass, which gives very low total synthesis time but means no partial output is available until the whole utterance is done. For short utterances the total latency is low enough that streaming is unnecessary. For long-form generation, these models struggle because the parallel computation still scales with utterance length and the model must predict all durations before any frame is emitted.
A practical escape: segment text into short phrases first, then run a fast non-autoregressive model on each phrase. Latency of the first phrase is the TTFA. This is the dominant pattern in production systems as of 2024.
Chunk size and prosodic discontinuities
Chunk size governs both latency and audio quality. Smaller chunks reduce TTFA but increase the risk of unnatural phrasing. The model has less context when deciding the fundamental frequency (F0) contour of each chunk. Audible artefacts include:
- Rising intonation at chunk boundaries where a declarative sentence should fall.
- Hesitation-like micro-pauses at chunk seams if overlap between chunks is not handled.
- Inconsistent speaking rate when durations are predicted per-chunk without a global plan.
A simple mitigation is to decode chunks with overlap (e.g. 20% lookahead): the model sees slightly more context than it will emit, then the overlapping frames are discarded. This adds a fixed latency equal to the overlap duration but substantially smooths boundary artefacts.
When it falls down
Sentence-final versus sentence-medial intonation. A system that fires off chunks eagerly will generate the intonation appropriate for a mid-utterance phrase on what turns out to be the final utterance. In English, rising pitch on the last clause of a response sounds uncertain or questioning. The only robust fix is either a global prosody model that estimates the full pitch contour before chunking (which costs latency) or a post-hoc pitch correction pass (which requires knowing when the last chunk has been committed).
High RTF under load. A single-GPU server handling 50 concurrent streaming sessions can see RTF spike above 1.0 under load, causing buffer underruns. Batch-within-stream strategies help: group concurrent chunk requests into a single GPU kernel call. But this adds scheduling latency that competes with TTFA.
LLM-TTS token rate mismatch. An LLM running at 30 tokens/second and a TTS system that needs at least 15-word phrases to start streaming may accumulate a 3-5 second back-pressure before the first audio chunk is ready. The fix is to let the TTS begin on shorter units (even 5-6 words), accepting worse prosody on the opening phrase.
Network jitter in cloud deployments. Streaming audio over HTTP/2 server-sent events or WebSockets is sensitive to TCP retransmission. A 150 ms jitter spike can empty the playback buffer even if RTF is nominally 0.3. Adaptive buffering or WebRTC-based delivery is needed for real-world reliability.
Stateful continuation is fragile. Streaming codec-LM inference must maintain the KV-cache state across chunk boundaries. A session that is interrupted (network drop, client reconnect) must either restart from the beginning, or checkpoint and restore KV state - neither is trivial.
Further reading
- High Fidelity Neural Audio Compression (EnCodec) - Défossez et al. 2022; the streaming encoder-decoder codec underpinning VALL-E and many subsequent systems.
- Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E) - Wang et al. 2023; establishes codec-token language modelling as the dominant TTS paradigm.
- BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data - Łajszczak et al. 2024; describes a production-grade streamable convolution decoder for codec tokens.
- AudioLM: a Language Modeling Approach to Audio Generation - Borsos et al. 2023; hybrid tokenisation scheme that separates semantic and acoustic codes, relevant to latency-quality trade-offs in hierarchical generation.