Vision & Multimodal
Endpointing and Voice Activity Detection
Endpointing and voice activity detection are the mechanisms that decide when a user has finished speaking, directly controlling the latency and correctness of every streaming ASR system.
intermediate · 8 min read
A voice assistant that cuts you off mid-sentence is broken. One that waits two seconds after you finish speaking to start responding is annoying. Both failures trace back to the same subsystem: the component that answers the question "has the speaker stopped yet?" That component is the endpointer, and it is built on top of voice activity detection (VAD).
The two terms are related but distinct. VAD is a binary classifier that labels each audio frame as speech or non-speech. Endpointing is the higher-level policy that uses VAD output (and often richer signals) to decide that an utterance is complete and the transcription engine should finalise its output.
Voice Activity Detection: the frame-level problem
The input to a VAD model is typically a short window of audio (10-30 ms) represented as a feature vector. Common choices are log-mel filterbank energies or raw waveform embeddings. The output is a probability that the frame contains speech.
Traditional VAD relied on handcrafted features: short-time energy, zero-crossing rate, and spectral entropy. The core insight is that voiced speech has higher energy and a smoother spectral envelope than background noise. Energy-based detectors work well in clean conditions but collapse when noise energy approaches speech energy (SNR below roughly 5 dB).
Modern VAD uses small neural networks trained on large, diverse corpora. Silero VAD, for instance, processes audio chunks of 30 ms or more and runs the full forward pass in under 1 ms on a single CPU thread. The model weighs approximately 2 MB and supports 8 kHz and 16 kHz audio. The key architectural choice is a recurrent or causal design: the model must not look at future frames if it is operating in streaming mode.
Formally, a VAD model computes:
p(speech | x_t, h_{t-1})
where x_t is the feature vector for frame t and h_{t-1} is the recurrent hidden state carrying context from previous frames. A threshold tau (typically 0.5, tunable) converts the probability to a binary label.
The threshold determines the trade-off between false alarm rate (non-speech labelled as speech) and miss rate (speech labelled as non-speech). In voice assistant applications the cost of a miss is higher than the cost of a false alarm, so tau is often set below 0.5.
Endpointing: from frame labels to utterance boundaries
VAD gives a sequence of per-frame binary labels. Endpointing converts that sequence into a single decision: "this utterance ended at time T."
The simplest endpointer applies a silence timeout: if VAD has been consistently non-speech for N consecutive frames, declare end-of-utterance (EOU). Typical values of N correspond to 300-800 ms of silence. This works but adds latency equal to N frames of silence after the speaker finishes.
More sophisticated endpointers incorporate acoustic, linguistic, and model-internal signals:
| Signal type | Example | Benefit |
|---|---|---|
| Acoustic silence | VAD label = 0 for 500 ms | Simple, language-agnostic |
| Prosodic features | Falling pitch, reduced energy | Detects sentence-final intonation |
| Language model score | High posterior on EOS token | Captures semantic completeness |
| ASR model-internal | Transducer blank token rate | Tight coupling with decoder state |
The ASR-model-internal approach has become dominant in end-to-end streaming systems. In a recurrent neural network transducer (RNN-T), the model emits blank tokens when no new label should be output. A sustained run of blank tokens following a substantive emission is a strong signal that the speaker has finished. Research has shown that adding an explicit end-of-word token and a delay penalty to RNN-T training yields more reliable frame-level speech activity detection for conversational speech (Anandh et al., 2025).
Semantic endpointing: letting the LM decide
Acoustic silence is not always the right cue. A pause mid-sentence (common in spontaneous speech, especially after filler words like "um") can trigger a premature endpoint. Conversely, a technically complete question ending in rising intonation may be followed by genuine silence, but the correct response requires the full semantic content.
Semantic VAD (Shi et al., 2023) addresses this by adding frame-level punctuation prediction as an auxiliary task alongside the binary speech/non-speech classifier. A predicted sentence-final punctuation mark signals semantic completeness, allowing the endpointer to fire even before the acoustic silence timer would expire. The paper reports a 53.3% reduction in average latency compared to the acoustic-only baseline, with no significant increase in character error rate.
A unified model such as TokenVerse (Kumar et al., 2024) goes further: a single transducer jointly performs ASR transcription, speaker change detection, named entity recognition, and semantic endpointing via task-specific tokens. This removes the cascade entirely, with gains of up to 7.7% in relative word error rate compared to pipelined systems.
On-device endpointing
Streaming ASR on mobile devices adds the constraint that everything must run in real time with limited compute. The approach described by Li et al. (2022) introduces an Encoder Endpointer model and an End-of-Utterance (EOU) Joint Layer that share encoder representations with the main ASR model. The joint layer produces a scalar EOU probability at each encoder frame; endpointing fires when that probability crosses a threshold. The entire system, including multilingual ASR and endpointing, runs on a mobile CPU in less than real time.
The shared-encoder design is important: a separate endpointer would add latency and compute. By branching off a lightweight head from the ASR encoder, the system gets endpointing almost for free.
When it falls down
Hesitations and disfluencies. Speakers often pause mid-utterance, especially in conversational or read-aloud contexts. A pure silence-based endpointer fires on these pauses, chopping the utterance in half. Longer silence timeouts reduce this but increase latency for clean utterances.
Low-SNR environments. When background noise is loud relative to speech, VAD models trained on clean or mildly noisy data hallucinate non-speech during voiced segments and speech during noise bursts. Augmenting training data with diverse noise types helps, but there is no reliable fix when SNR drops below 0 dB.
Music and non-speech vocalisations. Singing, laughter, and crying are acoustically closer to speech than to silence. Most VAD models trained on speech versus ambient noise will label these as speech, triggering false endpoints or keeping the endpoint from firing.
Domain mismatch in semantic endpointing. Punctuation prediction models are trained on written text; spontaneous speech punctuation is highly uncertain. When the ASR transcript contains domain-specific jargon not seen during LM training, semantic endpointing degrades silently.
Latency-accuracy trade-off is not eliminable. Every endpointing system faces a fundamental tension: firing early reduces latency but risks truncating the utterance; firing late preserves completeness but degrades user experience. The optimal threshold shifts across speakers, languages, noise conditions, and application domains. There is no universal setting.
Multi-talker and overlapping speech. Endpoint detection in overlapping speech (two people speaking simultaneously) is an open problem. An EOU token appended to a transducer trained on single-speaker data does not generalise cleanly to the multi-talker case without architectural changes (Lu et al., 2022).
Further reading
- Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction (Shi et al., 2023)
- Improving endpoint detection in end-to-end streaming ASR for conversational speech (Anandh et al., 2025)
- A Language Agnostic Multilingual Streaming On-Device ASR System (Li et al., 2022)
- Silero VAD: pre-trained enterprise-grade voice activity detector