Griffin-Lim and Classical Vocoders

Before neural vocoders, every TTS pipeline hit the same wall: you could predict a mel-spectrogram from text with reasonable accuracy, but a spectrogram discards phase information, and phase is what turns a magnitude plot into audible sound. The Griffin-Lim algorithm (1984) was the standard escape hatch from this wall for decades. Knowing how it works, why it sounds the way it does, and what finally superseded it is the fastest path to understanding why WaveNet and its descendants matter.

The phase problem

A Short-Time Fourier Transform (STFT) converts a waveform into a sequence of complex-valued frames. Each frame has a magnitude and a phase. Acoustic models such as Tacotron predict only the magnitude (or a mel-scaled compression of it) because phase is extremely noisy and hard to model directly. Recovering a waveform from magnitude alone is therefore an underdetermined inverse problem: infinitely many waveforms share the same magnitude spectrogram.

The key mathematical constraint that makes recovery possible is the consistency property: any spectrogram that is the true STFT of some signal must satisfy a set of linear constraints imposed by the overlap-add structure of the STFT. A randomly constructed magnitude-only spectrogram almost certainly violates those constraints. Griffin-Lim exploits this.

How Griffin-Lim works

The algorithm alternates two projections:

Project onto valid signals. Given a complex spectrogram estimate S, compute the inverse STFT to get a waveform x, then recompute the STFT of x to get S'. This step forces S' to be consistent (it is the true STFT of a real signal).
Project onto the magnitude constraint. Replace the magnitudes in S' with the target magnitudes M, keeping the phases from S'. This yields the next iterate.

Initialise: random phase P_0, target magnitude M
for t = 1, 2, ..., N_iter:
    S_t = M * exp(j * P_{t-1})        # reassemble with current phase
    x_t = iSTFT(S_t)                  # invert to waveform
    S_t' = STFT(x_t)                  # recompute full complex STFT
    P_t = angle(S_t')                 # extract updated phase estimate
return iSTFT(M * exp(j * P_t))

Each iteration is guaranteed to not increase reconstruction error (in the least-squares sense over all consistent spectrograms). The algorithm converges, though not necessarily to the global optimum. In practice, 32 to 60 iterations give diminishing returns; most implementations default to 32.

Tacotron (2017) shipped Griffin-Lim as its vocoder. The original paper used 50 iterations and a 12.5 ms frame shift. On a CPU, 50 iterations over a few seconds of audio runs in a second or two, which was acceptable for offline demos.

Classical vocoders before and alongside Griffin-Lim

Griffin-Lim is one member of a broader family of classical signal-processing vocoders. The term "vocoder" originally referred to a voice encoder - a system for compressing speech by parameterising it as a source-filter model:

A source (glottal excitation for voiced speech, noise for unvoiced)
A filter (the vocal tract, modelled as a sequence of LPC or LSP coefficients)

Vocoder type	Parametric representation	Synthesis method
LPC vocoder	Predictor coefficients + pitch	AR filter driven by excitation
WORLD	F0 + spectral envelope + aperiodicity	Minimum-phase + mixed excitation
STRAIGHT	Instantaneous frequency + envelope	Frequency-domain overlap-add
Griffin-Lim	Magnitude STFT	Iterative phase estimation

WORLD (Morise et al.) became the dominant classical vocoder in research TTS through the mid-2010s. It models three separate streams - fundamental frequency, spectral envelope, and aperiodicity - and synthesises speech by combining a periodic excitation with a noise component in proportion dictated by the aperiodicity mask. WORLD runs in roughly real time and produces cleaner reconstructions than raw Griffin-Lim because it has a genuine parametric model of the speech source, not just a phase-recovery trick.

STRAIGHT predates WORLD and achieves even higher perceptual quality by estimating an instantaneous frequency for each partial, which eliminates the "musical noise" artefact common in simpler spectral processing. Its computational cost made it impractical for fast synthesis.

Why classical vocoders sound the way they do

The characteristic artefacts are not random - they follow directly from the algorithms:

Buzziness / roboticness (LPC, WORLD): LPC models the vocal tract as a flat-spectrum AR process excited by periodic impulses. Natural speech has irregular glottal pulses and micro-variation in F0. A perfectly periodic excitation sounds unnatural.
Musical noise (Griffin-Lim): Random phase initialisation and the magnitude constraint together produce a kind of tonal shimmer. Spectral bins at different frames interfere in structured ways, creating faint pitched artefacts that move as the spectrogram changes.
Muffled high frequencies (most classical vocoders): Spectral envelope estimation smooths over fine detail. High-frequency content above ~6 kHz is particularly hard to model; it often gets attenuated or distorted.
Loss of breathiness and naturalness (WORLD/STRAIGHT): Aperiodicity estimation is an approximation. When a speaker's voice has fine-grained breathiness or vocal fry, the three-parameter model cannot capture it faithfully.

For the neural-vocoder era, these artefacts were the precise targets: WaveNet (2016) replaced the whole synthesis stage with an autoregressive waveform model; MelGAN (2019) showed a GAN-based generator could do it 100x faster than real time on a GPU.

When it falls down

High iteration counts do not fix the fundamental problem. Beyond about 100 iterations, Griffin-Lim quality plateaus. The algorithm converges to a local minimum whose perceptual quality is bounded by the consistency constraint, not by iteration count.

Spectral smearing compounds the error. When the acoustic model produces a mel-spectrogram that is itself inaccurate (wrong pitch, blurred formants), Griffin-Lim has no way to recover the correct waveform. It faithfully reconstructs a waveform consistent with the wrong spectrogram. Neural vocoders trained end-to-end are somewhat more robust to spectrogram errors because they learn a distribution over plausible waveforms.

Silence and fricatives are particularly poor. Low-energy regions of the spectrogram have a high phase-noise ratio, so Griffin-Lim renders them as hiss. Fricatives (s, f, sh) require realistic noise with correct spectral shaping; the algorithm provides neither.

Prosody transfer and voice conversion fail gracefully but audibly. If you shift the pitch of a mel-spectrogram before inversion, Griffin-Lim has no model of what "a valid waveform at that pitch" looks like. The result is often intelligible but unnatural.

Speed scales with audio length and hop size. Each iteration requires a forward and inverse STFT. For 24 kHz audio with a 256-sample hop, a 10-second clip needs ~940 frames; 50 iterations means 94,000 FFT operations. Still faster than WaveNet autoregressive sampling (which operates sample by sample), but slower than single-pass neural models like HiFi-GAN.

Not differentiable end-to-end in practice. While an approximate gradient through Griffin-Lim iterations exists, it is rarely used. The vocoder is typically treated as a post-processing step, decoupled from acoustic model training. This means vocoder errors do not inform the acoustic model.

The phase problem

How Griffin-Lim works

Classical vocoders before and alongside Griffin-Lim

Why classical vocoders sound the way they do

When it falls down

Further reading