← Concept library

Vision & Multimodal

Zero-Shot Voice Cloning

Zero-shot voice cloning synthesises speech in the voice of an unseen speaker from a short reference recording, without any fine-tuning at inference time.

advanced · 8 min read · Premium

Three seconds of audio is all VALL-E needs to mimic a speaker it has never heard before. That specific claim, from Microsoft Research's 2023 paper, crystallised a shift: voice cloning had moved from a fine-tuning problem into an in-context learning problem. The distinction matters because it changes the architecture, the training data requirements, and the failure modes entirely.

What "zero-shot" actually means here

Classical multi-speaker TTS requires either re-training a model on the target speaker (speaker adaptation) or enrolling the speaker into a fixed embedding table learned during training. Both require the speaker to be known before inference.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied