Vision & Multimodal
Zero-Shot Voice Cloning
Zero-shot voice cloning synthesises speech in the voice of an unseen speaker from a short reference recording, without any fine-tuning at inference time.
advanced · 8 min read · Premium
Three seconds of audio is all VALL-E needs to mimic a speaker it has never heard before. That specific claim, from Microsoft Research's 2023 paper, crystallised a shift: voice cloning had moved from a fine-tuning problem into an in-context learning problem. The distinction matters because it changes the architecture, the training data requirements, and the failure modes entirely.
What "zero-shot" actually means here
Classical multi-speaker TTS requires either re-training a model on the target speaker (speaker adaptation) or enrolling the speaker into a fixed embedding table learned during training. Both require the speaker to be known before inference.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.