Vision & Multimodal
VALL-E: TTS as Token Language Modelling
VALL-E reformulates text-to-speech as a conditional language modelling problem over discrete audio codec tokens, enabling zero-shot voice cloning from a three-second recording by treating acoustic context the same way GPT treats a few-shot text prompt.
advanced · 8 min read · Premium
Three seconds of audio. That is all VALL-E needs to clone a voice it has never heard, producing speech that preserves not just the speaker's timbre but also their room acoustics and emotional colouring. The trick is not a better vocoder or a fancier acoustic model. It is a reframing: treat speech synthesis as next-token prediction, exactly as a large language model treats text.
The Old Pipeline and Why It Bottlenecks
Classical TTS pipelines decompose synthesis into separable stages. A text front-end normalises and phonemises the input. An acoustic model (Tacotron 2, FastSpeech 2, etc.) maps phoneme sequences to mel-spectrograms. A vocoder (HiFi-GAN, WaveNet) converts spectrograms to waveforms. Each stage is trained on clean studio recordings from a handful of speakers, typically hundreds of hours per voice.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.