Diffusion Models for Speech

WaveNet synthesised speech so convincingly that listeners scored it above 4.0 MOS in 2016, but generating one second of audio required thousands of sequential autoregressive steps. Every improvement in quality came with a corresponding tax on latency. Diffusion models broke that coupling: DiffWave (ICLR 2021) reached MOS 4.44 on a vocoding task while generating audio in parallel, matching a strong WaveNet baseline at orders-of-magnitude higher throughput.

What diffusion actually does

A diffusion model defines two Markov chains. The forward chain gradually corrupts a clean data sample x_0 by adding Gaussian noise across T steps, arriving at x_T which is approximately standard normal. The reverse chain learns to undo that corruption step by step, recovering structure from noise.

For a fixed noise schedule beta_1, ..., beta_T the forward marginal has a closed form:

q(x_t | x_0) = N(x_t ; sqrt(alpha_bar_t) * x_0, (1 - alpha_bar_t) * I)

where alpha_bar_t = product of (1 - beta_s) for s = 1..t. This lets you sample any noisy version of the data directly without running the chain forward step by step.

The model is trained to predict either the added noise epsilon or the clean sample x_0 from (x_t, t, conditioning). At inference, you start from x_T ~ N(0, I) and run the learned reverse transitions, gradually refining the signal.

For speech there are two places to apply this process: the raw waveform at 16 kHz or 24 kHz, or a compressed intermediate such as a mel-spectrogram.

Waveform-domain models

WaveGrad (Chen et al., 2020) and DiffWave (Kong et al., 2020) both operate directly on 1-D waveforms conditioned on a mel-spectrogram. They differ slightly in architecture (WaveGrad uses a film-conditioned U-Net; DiffWave uses a dilated WaveNet backbone) but share the same core insight: the reverse diffusion chain can be computed in parallel across timesteps of the audio signal, unlike an autoregressive model which must generate each sample after the previous one.

The diffusion timestep T is a hyperparameter with real engineering consequences:

T at training	T at inference	Relative quality	Relative speed
1000	1000	Highest	Slowest
1000	6	Slightly reduced	~50x faster
1000	3	Noticeable drop	~150x faster

WaveGrad showed that six reverse steps often suffice for production-quality vocoding, provided the noise schedule is fine-tuned. This schedule mismatch (train long, infer short) is central to practical deployment.

Spectrogram-domain models and full TTS pipelines

Grad-TTS (Popov et al., 2021) applies score-based diffusion to mel-spectrograms rather than waveforms. It frames TTS as aligning text to a target mel-spectrogram where the prior is not pure Gaussian noise but a noise-corrupted version of a text-dependent mean computed by a learned duration/alignment model. This gives the reverse chain a much more informative starting point.

What diffusion actually does

Waveform-domain models

Spectrogram-domain models and full TTS pipelines

Keep reading with Pro.