Vision & Multimodal
Diffusion Models for Speech
Diffusion models iteratively denoise random Gaussian noise into speech waveforms or mel-spectrograms, achieving sample quality that matches autoregressive vocoders at a fraction of the sequential compute cost.
advanced · 8 min read · Premium
WaveNet synthesised speech so convincingly that listeners scored it above 4.0 MOS in 2016, but generating one second of audio required thousands of sequential autoregressive steps. Every improvement in quality came with a corresponding tax on latency. Diffusion models broke that coupling: DiffWave (ICLR 2021) reached MOS 4.44 on a vocoding task while generating audio in parallel, matching a strong WaveNet baseline at orders-of-magnitude higher throughput.
What diffusion actually does
A diffusion model defines two Markov chains. The forward chain gradually corrupts a clean data sample x_0 by adding Gaussian noise across T steps, arriving at x_T which is approximately standard normal. The reverse chain learns to undo that corruption step by step, recovering structure from noise.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.