Vision & Multimodal
Speaker Diarisation
Speaker diarisation segments an audio recording into speaker-homogeneous regions and assigns each region a speaker identity, answering the question "who spoke when" without necessarily transcribing what was said.
intermediate · 8 min read
A one-hour board meeting recording arrives as a single mono WAV file. Your ASR engine transcribes it faithfully, but the transcript is a wall of text with no indication of who said what. That is the precise problem diarisation solves: partition the audio timeline into contiguous segments and label each segment with a speaker identifier. The labels are arbitrary ("SPEAKER_0", "SPEAKER_1") unless an external reference is available; the system has no concept of names, only of acoustically distinct identities.
The classical pipeline
Most production diarisation systems share the same four-stage skeleton:
| Stage | Input | Output |
|---|---|---|
| Voice activity detection (VAD) | Raw waveform | Non-speech frames removed |
| Segmentation | VAD-filtered audio | Short, speaker-homogeneous windows |
| Embedding extraction | Each segment | Fixed-length speaker vector |
| Clustering | Set of speaker vectors | Cluster labels per segment |
Voice activity detection strips silence and music. Getting this wrong poisons every downstream step: a missed silence interval bleeds one speaker's frames into the next speaker's segment.
Segmentation cuts the VAD-passed audio into windows of roughly 1-3 seconds, ideally at silence boundaries. Historically, BIC (Bayesian Information Criterion) change-point detection found speaker-change boundaries by asking whether a single Gaussian or two Gaussians better fit two adjacent windows.
Embedding extraction converts each segment into a compact vector that captures speaker identity while discarding content. x-vectors (time-delay neural networks trained with cross-entropy on speaker labels) were the dominant approach for several years. They represent a speaker's vocal tract characteristics, prosody, and speaking style as a point in a 512-dimensional space. More recently, ResNet-based embeddings have superseded x-vectors on most benchmarks.
Clustering groups segments whose embeddings are close. Agglomerative hierarchical clustering (AHC) with cosine distance is the most widely deployed variant. VBx (Variational Bayes over x-vector sequences modelled by a Bayesian HMM) is a principled alternative that infers both the number of speakers and the segment assignments jointly, avoiding the separate "decide number of clusters" step.
Speaker embeddings in depth
The embedding extractor is the system's beating heart. A typical x-vector extractor:
- Computes 23-dim Mel filterbank features per 25 ms frame.
- Passes frames through several TDNN layers with asymmetric context (e.g. [-2, 2] frames).
- Applies a statistics pooling layer that concatenates the frame-level mean and standard deviation over the whole segment to produce a fixed-length summary.
- Projects through two fully connected layers; the first FC output is the x-vector.
Training uses speaker classification on thousands of speakers (VoxCeleb, CN-Celeb). The network never sees the target recording; it learns a general-purpose speaker representation that transfers to unseen conditions.
Pseudo-trace of the pooling step:
frame_outputs : [T, D] # T frames, D-dim TDNN output
mean = frame_outputs.mean(0) # [D]
std = frame_outputs.std(0) # [D]
segment_vector = concat(mean, std) # [2D] -> into FC layers
After extraction, embeddings are length-normalised (L2) and optionally projected with PLDA (Probabilistic Linear Discriminant Analysis) to sharpen the within-speaker vs. between-speaker discrimination.
End-to-end and neural alternatives
The modular pipeline has an obvious weakness: errors compound across stages and none of the components optimise the final diarisation objective directly. End-to-end (E2E) approaches attempt to fix this.
EEND (End-to-End Neural Diarisation) reformulates diarisation as a sequence-to-sequence problem. The model ingests a sequence of acoustic features and outputs, for each frame, a probability distribution over which subset of speakers is active. It is trained directly on diarisation labels with a permutation-invariant training loss (PIT) that matches predicted speaker channels to reference channels in the lowest-error assignment.
EEND handles overlapping speech naturally (two output channels can be active simultaneously), but it fixes the maximum number of speakers at training time. Attractor-based extensions (EEND-EDA) address this by generating a variable number of speaker attractors from the input sequence.
Clustering-based systems with neural overlap detection offer a practical compromise: run the classical pipeline, then layer on a separate binary classifier that detects overlapping frames and assigns them to the top-2 nearest cluster centroids.
When it falls down
Overlapping speech. Two people talking at once is the most common failure mode. Classical clustering assigns each frame to exactly one speaker; a frame with two simultaneous voices will be attributed to whichever speaker it is acoustically closer to, silently dropping the other.
Short turns. Segment embeddings computed from less than 0.5 seconds of speech are unreliable. Call-centre conversations with rapid back-channels ("uh-huh", "yeah") are frequently mis-attributed or collapsed into the previous speaker.
Identical-sounding speakers. Twins, same-sex pairs with similar vocal tracts, or a single speaker imitating another will confuse any system that relies purely on acoustic features. Without linguistic or temporal context, there is no acoustic signal to separate them.
Domain mismatch. An extractor trained on telephone speech (8 kHz, narrow-band) degrades sharply on far-field meeting microphones (16 kHz, reverberation, multiple simultaneous noise sources). Fine-tuning on in-domain data is almost always necessary.
Number-of-speakers estimation. AHC requires a stopping criterion: an absolute number, an elbow heuristic on linkage distances, or a tuned threshold. In practice, this threshold is the single most-sensitive hyperparameter. VBx sidesteps this but requires its own hyper-parameter (speaker regularisation weight).
Long recordings with speaker drift. A speaker's vocal qualities change subtly over hours (fatigue, emotion, microphone position changes). Embeddings computed early and late in a session for the same speaker may fall into different clusters; this appears as over-segmentation.
Evaluation metric subtleties. The standard metric, DER (Diarisation Error Rate), is the sum of false alarm, missed speech, and speaker confusion rates over total reference speech time. DER ignores transcription errors and treats all speaker confusion equally regardless of duration, which can make a system look good on a recording dominated by a single long-turn speaker.
Further reading
- Fully Supervised Speaker Diarization (UIS-RNN) - Zhang et al., 2018; the foundational end-to-end trainable diarisation model using d-vectors and RNNs.
- Bayesian HMM clustering of x-vector sequences (VBx) - Landini et al., 2020; a principled clustering back-end with strong benchmark results.
- DiariZen Explained: A Tutorial for the Open Source State-of-the-Art Speaker Diarization Pipeline - Raghav, 2026; a practical seven-stage walkthrough with code references and visual explanations.
- SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription - Dai et al., 2026; illustrates joint diarisation and ASR in an LLM framework, with analysis of overlapping speech and rapid turn-taking.