Vision & Multimodal
WaveNet
WaveNet is a fully autoregressive convolutional neural network that models raw audio waveforms one sample at a time, achieving near-human speech quality at the cost of extremely slow sequential generation.
intermediate · 7 min read
Before WaveNet shipped in 2016, the dominant text-to-speech pipeline glued together a signal-processing vocoder to a hand-crafted acoustic model. The result was intelligible but never convincing - listeners could always detect the synthetic shimmer. DeepMind's WaveNet paper (van den Oord et al., 2016) bypassed the entire vocoder chain and asked a simpler question: can a deep neural network learn a probability distribution directly over raw audio samples? The answer was yes, and the mean opinion score it achieved was closer to a human recording than any previous system.
What WaveNet models
Audio is a time series of amplitude samples. At 16 kHz (telephone quality), one second of speech requires 16,000 samples. At 24 kHz (broadcast quality), 24,000. A WaveNet models the joint probability of an audio sequence as a product of conditionals:
p(x) = product over t of p(x_t | x_1, ..., x_{t-1})
Each sample is predicted from all samples that came before it. That is the autoregressive property. The model outputs a categorical distribution over 256 amplitude levels (mu-law companded, a perceptually uniform quantisation), so training reduces to standard cross-entropy.
Conditioning on text happens through a local conditioning signal - typically a sequence of linguistic or phonetic features aligned to audio frames - which is upsampled and injected into every layer via a bias term. A global conditioning signal (a speaker embedding vector) handles multi-speaker models by letting the network shift its style wholesale.
Dilated causal convolutions
The interesting architecture question is: how do you give the network a long receptive field without blowing up the parameter count?
WaveNet uses dilated causal convolutions. A causal convolution masks future samples from the input, preserving the autoregressive constraint. Dilation multiplies the spacing between the filter taps:
Dilation 1: x[t] x[t-1] x[t-2] x[t-3] (standard)
Dilation 2: x[t] x[t-2] x[t-4] x[t-6]
Dilation 4: x[t] x[t-4] x[t-8] x[t-12]
Dilation 8: x[t] x[t-8] x[t-16] x[t-24]
Stacking layers with doublings (1, 2, 4, 8, 16, 32, ...) and repeating the stack multiple times gives an exponentially growing receptive field. In the original paper, 30 layers (three repetitions of a 1-2-4-8-16-32-64-128-256-512 stack) yield a receptive field of around 240 ms at 16 kHz - enough to capture prosodic patterns and coarticulation across phoneme boundaries.
Each layer applies a gated activation function borrowed from PixelCNN:
z = tanh(W_f * x + V_f * h) * sigmoid(W_g * x + V_g * h)
where h is the conditioning signal, W and V are 1-D convolution filters, and * denotes convolution. The gating lets the network selectively suppress or amplify features at each layer.
Residual connections carry the signal from the input of each layer to its output, and skip connections aggregate contributions from all layers before the final 1x1 conv + softmax head. This mirrors the skip-connection trick that made ResNets trainable at depth.
From raw waveform to speech quality
What made the mean opinion score jump was not one trick but the combination:
| Design choice | Effect |
|---|---|
| Raw waveform modelling | No vocoder artefacts (buzziness, ringing) |
| Dilated causal convolutions | Long context without recurrence |
| mu-law quantisation (256 bins) | Perceptually uniform resolution |
| Gated activations | Non-linear feature selection |
| Multi-speaker conditioning | One model, many voices |
In the 2016 paper, WaveNet scored 4.21 MOS for US English versus 4.55 for natural speech and 3.86 for the best concatenative baseline. Mandarin results were similar. The gap to human quality was smaller than any previous system had achieved, and critically, the failure modes were different: it could produce slightly odd prosody, but not the robotic metallic timbre that marked prior generation.
When it falls down
Generation speed is the original sin. Because each sample depends on all previous samples, inference is strictly sequential. On a GPU, generating one second of 16 kHz audio took roughly one second of wall-clock time in the original implementation - real-time at best, slower in practice. This ruled out on-device or low-latency applications entirely.
Parallel WaveNet (van den Oord et al., 2017) addressed this with probability density distillation: a fast inverse autoregressive flow student network is trained to match the teacher WaveNet's distribution. The student generates all samples in parallel and reaches 20x faster-than-real-time, but requires a pre-trained WaveNet as the teacher, adding training complexity.
Long-range semantic control is weak. WaveNet conditions on pre-computed linguistic features, not raw text. That means it depends heavily on the quality of the text-analysis front end: a bad grapheme-to-phoneme step or incorrect stress prediction degrades output regardless of the acoustic model's quality.
Data hunger. High-quality single-speaker WaveNet models require tens of hours of clean, consistently recorded speech. Noise, room reverb, or mic variation in training data degrades fidelity noticeably.
Quantisation artefacts at low bitrates. The 256-bin mu-law discretisation is sufficient for clean speech, but for musical content or expressive prosody with wide dynamic range it introduces subtle staircasing. Mixture of logistics (as in PixelCNN++) or flow-based continuous output distributions partially address this.
Superseded for production use. Subsequent vocoder architectures - WaveGlow, HiFi-GAN, and codec-based models - achieve comparable or better quality at a fraction of the inference cost. WaveNet's architectural ideas (dilated causal convolutions, gated activations) live on inside these successors, but vanilla WaveNet is rarely trained from scratch today.