← Concept library

Vision & Multimodal

VALL-E: TTS as Token Language Modelling

VALL-E reformulates text-to-speech as a conditional language modelling problem over discrete audio codec tokens, enabling zero-shot voice cloning from a three-second recording by treating acoustic context the same way GPT treats a few-shot text prompt.

advanced · 8 min read · Premium

Three seconds of audio. That is all VALL-E needs to clone a voice it has never heard, producing speech that preserves not just the speaker's timbre but also their room acoustics and emotional colouring. The trick is not a better vocoder or a fancier acoustic model. It is a reframing: treat speech synthesis as next-token prediction, exactly as a large language model treats text.

The Old Pipeline and Why It Bottlenecks

Classical TTS pipelines decompose synthesis into separable stages. A text front-end normalises and phonemises the input. An acoustic model (Tacotron 2, FastSpeech 2, etc.) maps phoneme sequences to mel-spectrograms. A vocoder (HiFi-GAN, WaveNet) converts spectrograms to waveforms. Each stage is trained on clean studio recordings from a handful of speakers, typically hundreds of hours per voice.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied