← Concept library

Vision & Multimodal

Speaker Diarisation

Speaker diarisation segments an audio recording into speaker-homogeneous regions and assigns each region a speaker identity, answering the question "who spoke when" without necessarily transcribing what was said.

intermediate · 8 min read

A one-hour board meeting recording arrives as a single mono WAV file. Your ASR engine transcribes it faithfully, but the transcript is a wall of text with no indication of who said what. That is the precise problem diarisation solves: partition the audio timeline into contiguous segments and label each segment with a speaker identifier. The labels are arbitrary ("SPEAKER_0", "SPEAKER_1") unless an external reference is available; the system has no concept of names, only of acoustically distinct identities.

The classical pipeline

Most production diarisation systems share the same four-stage skeleton:

Stage Input Output
Voice activity detection (VAD) Raw waveform Non-speech frames removed
Segmentation VAD-filtered audio Short, speaker-homogeneous windows
Embedding extraction Each segment Fixed-length speaker vector
Clustering Set of speaker vectors Cluster labels per segment

Voice activity detection strips silence and music. Getting this wrong poisons every downstream step: a missed silence interval bleeds one speaker's frames into the next speaker's segment.

Segmentation cuts the VAD-passed audio into windows of roughly 1-3 seconds, ideally at silence boundaries. Historically, BIC (Bayesian Information Criterion) change-point detection found speaker-change boundaries by asking whether a single Gaussian or two Gaussians better fit two adjacent windows.

Embedding extraction converts each segment into a compact vector that captures speaker identity while discarding content. x-vectors (time-delay neural networks trained with cross-entropy on speaker labels) were the dominant approach for several years. They represent a speaker's vocal tract characteristics, prosody, and speaking style as a point in a 512-dimensional space. More recently, ResNet-based embeddings have superseded x-vectors on most benchmarks.

Clustering groups segments whose embeddings are close. Agglomerative hierarchical clustering (AHC) with cosine distance is the most widely deployed variant. VBx (Variational Bayes over x-vector sequences modelled by a Bayesian HMM) is a principled alternative that infers both the number of speakers and the segment assignments jointly, avoiding the separate "decide number of clusters" step.

Speaker embeddings in depth

The embedding extractor is the system's beating heart. A typical x-vector extractor:

  1. Computes 23-dim Mel filterbank features per 25 ms frame.
  2. Passes frames through several TDNN layers with asymmetric context (e.g. [-2, 2] frames).
  3. Applies a statistics pooling layer that concatenates the frame-level mean and standard deviation over the whole segment to produce a fixed-length summary.
  4. Projects through two fully connected layers; the first FC output is the x-vector.

Training uses speaker classification on thousands of speakers (VoxCeleb, CN-Celeb). The network never sees the target recording; it learns a general-purpose speaker representation that transfers to unseen conditions.

Pseudo-trace of the pooling step:

frame_outputs : [T, D]          # T frames, D-dim TDNN output
mean     = frame_outputs.mean(0)  # [D]
std      = frame_outputs.std(0)   # [D]
segment_vector = concat(mean, std) # [2D]  -> into FC layers

After extraction, embeddings are length-normalised (L2) and optionally projected with PLDA (Probabilistic Linear Discriminant Analysis) to sharpen the within-speaker vs. between-speaker discrimination.

End-to-end and neural alternatives

The modular pipeline has an obvious weakness: errors compound across stages and none of the components optimise the final diarisation objective directly. End-to-end (E2E) approaches attempt to fix this.

EEND (End-to-End Neural Diarisation) reformulates diarisation as a sequence-to-sequence problem. The model ingests a sequence of acoustic features and outputs, for each frame, a probability distribution over which subset of speakers is active. It is trained directly on diarisation labels with a permutation-invariant training loss (PIT) that matches predicted speaker channels to reference channels in the lowest-error assignment.

EEND handles overlapping speech naturally (two output channels can be active simultaneously), but it fixes the maximum number of speakers at training time. Attractor-based extensions (EEND-EDA) address this by generating a variable number of speaker attractors from the input sequence.

Clustering-based systems with neural overlap detection offer a practical compromise: run the classical pipeline, then layer on a separate binary classifier that detects overlapping frames and assigns them to the top-2 nearest cluster centroids.

When it falls down

Overlapping speech. Two people talking at once is the most common failure mode. Classical clustering assigns each frame to exactly one speaker; a frame with two simultaneous voices will be attributed to whichever speaker it is acoustically closer to, silently dropping the other.

Short turns. Segment embeddings computed from less than 0.5 seconds of speech are unreliable. Call-centre conversations with rapid back-channels ("uh-huh", "yeah") are frequently mis-attributed or collapsed into the previous speaker.

Identical-sounding speakers. Twins, same-sex pairs with similar vocal tracts, or a single speaker imitating another will confuse any system that relies purely on acoustic features. Without linguistic or temporal context, there is no acoustic signal to separate them.

Domain mismatch. An extractor trained on telephone speech (8 kHz, narrow-band) degrades sharply on far-field meeting microphones (16 kHz, reverberation, multiple simultaneous noise sources). Fine-tuning on in-domain data is almost always necessary.

Number-of-speakers estimation. AHC requires a stopping criterion: an absolute number, an elbow heuristic on linkage distances, or a tuned threshold. In practice, this threshold is the single most-sensitive hyperparameter. VBx sidesteps this but requires its own hyper-parameter (speaker regularisation weight).

Long recordings with speaker drift. A speaker's vocal qualities change subtly over hours (fatigue, emotion, microphone position changes). Embeddings computed early and late in a session for the same speaker may fall into different clusters; this appears as over-segmentation.

Evaluation metric subtleties. The standard metric, DER (Diarisation Error Rate), is the sum of false alarm, missed speech, and speaker confusion rates over total reference speech time. DER ignores transcription errors and treats all speaker confusion equally regardless of duration, which can make a system look good on a recording dominated by a single long-turn speaker.

Further reading

Sign in to save and react.
Share Copied