Zero-Shot Voice Cloning

Three seconds of audio is all VALL-E needs to mimic a speaker it has never heard before. That specific claim, from Microsoft Research's 2023 paper, crystallised a shift: voice cloning had moved from a fine-tuning problem into an in-context learning problem. The distinction matters because it changes the architecture, the training data requirements, and the failure modes entirely.

What "zero-shot" actually means here

Classical multi-speaker TTS requires either re-training a model on the target speaker (speaker adaptation) or enrolling the speaker into a fixed embedding table learned during training. Both require the speaker to be known before inference.

Zero-shot voice cloning removes that constraint. At inference time the model receives:

A reference audio clip from an arbitrary, previously unseen speaker.
A text string to synthesise.

The model must produce speech that sounds like that speaker saying that text, with no weight updates, no fine-tuning, and no explicit speaker registration step. The speaker is "in-context", not "in-weights".

This is analogous to few-shot prompting in language models: the model generalises from its training distribution to handle the new example at runtime. Whether a system is genuinely zero-shot or merely produces plausible-sounding impersonations is measured empirically through speaker similarity scores (typically cosine similarity between speaker embeddings) and word error rate on the output.

Two generations of architecture

Generation 1: Speaker-encoder conditioning

The approach described in Jia et al. (2018) from Google remains the conceptual baseline. Three independently trained components work together:

Component	Role
Speaker encoder	Maps a reference waveform to a fixed-dim embedding d-vector
Acoustic model (Tacotron 2)	Generates mel spectrograms from text, conditioned on the d-vector
Vocoder (WaveNet)	Converts mel spectrograms to raw waveform

The speaker encoder is trained on a speaker verification objective (GE2E loss), not on TTS data. The insight is that a model trained to distinguish speakers already learns a compact, speaker-identity-preserving representation. This embedding is concatenated to every decoder step in Tacotron 2.

Training the three components separately is pragmatic but suboptimal: the speaker encoder is not aware of what information the synthesiser actually needs, and the mel-spectrogram intermediate representation is a lossy lossy bridge. Naturalness suffers on speakers far from the training distribution.

Generation 2: Codec language models (VALL-E and descendants)

VALL-E (Wang et al., 2023) reformulates zero-shot TTS as a conditional language modelling problem over discrete audio tokens. The key shift:

Classical TTS:  text -> continuous mel spectrogram -> waveform
VALL-E:         text + ref_audio_tokens -> discrete codec tokens -> waveform

The discrete tokens come from EnCodec (Défossez et al., 2022), a neural audio codec that compresses audio into a residual vector quantisation (RVQ) hierarchy. EnCodec produces 8 codebook levels; VALL-E uses the first codebook autoregressively and the remaining 7 in parallel via a non-autoregressive model.

What "zero-shot" actually means here

Two generations of architecture

Generation 1: Speaker-encoder conditioning

Generation 2: Codec language models (VALL-E and descendants)

Keep reading with Pro.