The Mel-Spectrogram Interface

Take any modern text-to-speech system built after 2017 and cut it in half. The left half reads text and produces a 2-D image of sound. The right half turns that image into audio samples. Almost universally, the image in the middle is a mel spectrogram. Understanding why that specific representation won, what it encodes, and where it breaks is the clearest path into the entire TTS stack.

What a spectrogram actually stores

A raw audio signal is a 1-D sequence of amplitude values, often sampled at 22,050 or 24,000 Hz. A 5-second clip is therefore roughly 110,000 numbers, with no obvious correspondence to the perceptual features a model needs to learn (pitch, formants, prosody).

The short-time Fourier transform (STFT) offers a first fix. We slide a window of length n_fft across the signal in hops of hop_length samples, apply a Hann window to reduce spectral leakage, and compute the DFT of each frame:

X[k, t] = sum_{n=0}^{N-1}  x[n + t*H] * w[n] * exp(-j 2π k n / N)

where N = n_fft, H = hop_length, and k indexes frequency bins. The result is a complex matrix; we usually take the magnitude squared to get a power spectrogram of shape (N/2 + 1, T).

That is still a linear frequency axis. The human ear does not perceive frequency linearly: a jump from 100 Hz to 200 Hz sounds much larger than a jump from 5000 Hz to 5100 Hz. The mel scale approximates the ear's response by compressing high frequencies:

mel(f) = 2595 * log10(1 + f / 700)

We apply a triangular filterbank of n_mels filters (typically 80) spaced evenly on the mel scale, multiply each filter against the linear power spectrum, sum within each filter, and take the log:

M[m, t] = log( sum_k  H_m[k] * |X[k, t]|^2 )

The output is the log-mel spectrogram: a matrix of shape (n_mels, T), usually (80, T). For a 5-second clip at 22,050 Hz with hop_length=256, T is about 430 frames. The representation is now roughly 430 x 80, far fewer numbers than the original waveform and structured in a way that mirrors human perception.

Why TTS systems converge on this representation

Before Tacotron (Wang et al., 2017), TTS pipelines had separate, hand-crafted modules: text analysis, duration modelling, acoustic parameter extraction (typically mel-cepstral coefficients or vocoder parameters), and a parametric vocoder. Each module introduced its own error floor, and the hand-crafted features were brittle across speakers and domains.

Tacotron 2 (Shen et al., 2017) made the central architectural bet: train a sequence-to-sequence model to predict mel spectrograms directly from character or phoneme sequences, then invert the spectrograms to waveforms with a separately-trained neural vocoder (WaveNet, in their case). The benefits were immediate and concrete:

Acoustic models are comparatively cheap to train on mel spectrograms. An 80-bin mel frame is a 80-dimensional vector, not a 256-point complex spectrum. Attention mechanisms and convolutional stacks handle this scale easily.
The representation abstracts away the waveform synthesis problem. The acoustic model does not need to know whether the vocoder uses WaveNet, GAN, or a flow; it just needs to produce plausible mel frames. This clean separation allowed vocoders to be swapped and improved independently.
Log-mel features are perceptually centred. The log compression aligns dynamic range with human loudness perception. The mel filterbank aligns frequency resolution with pitch discrimination. A model learning on this space is being asked to match what a listener actually hears, not to minimise raw waveform MSE.

The combination of a cleaner learning target and the separation of concerns drove rapid progress. Almost every major architecture that followed (FastSpeech, VITS, YourTTS, NaturalSpeech) either produces mel spectrograms or uses them as training targets even when the full pipeline is end-to-end differentiable.

The inversion side: vocoders and their constraints

A mel spectrogram is not invertible in a strict mathematical sense. The mel filterbank is a many-to-one mapping from the linear spectrum, so inverting it requires making assumptions about the phase and about the distribution of energy within each filter's frequency band. This is the vocoder's job.

Three broad families exist:

Vocoder family	Example	Speed	Artefact character
Griffin-Lim (iterative phase estimation)	Griffin-Lim (1984)	Fast	Phasiness, metallic ring
Autoregressive neural	WaveNet	Very slow (CPU)	Near-natural, high cost
GAN-based	HiFi-GAN (Kong et al., 2020)	Real-time or faster	Occasional buzz on unseen speakers
Flow / diffusion	WaveGlow, DiffWave	Moderate	Good; heavy memory

HiFi-GAN is currently the de-facto standard for offline synthesis: it operates directly on mel spectrograms and generates audio 167 times faster than real-time on a single GPU, with MOS scores close to WaveNet.

The crucial point is that the vocoder's output quality is bounded by the mel spectrogram it receives. If the acoustic model produces blurry or inconsistent mel frames, no vocoder can recover the missing information. The mel spectrogram is, literally, the interface contract between the two halves of the pipeline.

Key hyperparameters and their effects

Three numbers dominate mel spectrogram quality for TTS:

n_fft (FFT window size). Larger windows give finer frequency resolution but coarser time resolution. Typical value: 1024 samples at 22,050 Hz. At 44,100 Hz you would double this.
hop_length. Controls frame rate. 256 samples at 22,050 Hz gives ~11.6 ms per frame, which is fine enough for prosody but near the limit for stop consonants. Using 128 doubles temporal resolution at roughly double the sequence length and training cost.
n_mels. 80 is the community default, balancing expressivity against model size. 128-mel spectrograms are common in higher-fidelity systems; below 64 you start to lose fricative detail audibly.

Mismatching these between training and inference is a silent failure: the model produces audio, but it sounds slightly wrong in ways that are hard to attribute without checking the config.

When it falls down

High-frequency detail loss. The mel filterbank's wide high-frequency filters mean that consonants above ~8 kHz (sibilants: /s/, /sh/, /f/) are merged into broad energy bins. For 16 kHz telephone-quality synthesis this is invisible; for 44 kHz music synthesis it becomes audible.

Phase information is discarded. The log-mel representation throws away all phase. Phase errors are the dominant cause of the "phasiness" or "buzzy" quality in Griffin-Lim outputs. GAN vocoders learn implicit phase priors from data, but they can fail on out-of-distribution pitch ranges or speaking styles.

Blurry predictions under mean-squared-error training. If an acoustic model is trained with a simple L1 or L2 loss on mel frames (as early Tacotron variants were), the model learns to predict the mean of the distribution over plausible futures. The mel frames look slightly blurred, and the vocoder converts that blur into a muffled or over-smooth sound. Modern systems address this with adversarial losses, flow-based decoders, or diffusion, but it is a fundamental tension in the representation's use.

Silence and noise boundary artefacts. Silence in mel log space is not zero; it is a large negative number determined by the log of near-zero filter energy. If a model does not learn the exact silence floor, vocoders produce low-level background crackle or fail to produce clean pauses.

Variable-length sequences and attention instability. Mel spectrograms grow linearly with utterance length. Sequence-to-sequence models with soft attention (as in Tacotron 2) occasionally "skip" or "repeat" frames for long sentences, producing cutoff words or stuttering. Non-attentive duration-explicit models (FastSpeech family) bypass this, but they require an explicit duration predictor or external aligner.

What a spectrogram actually stores

Why TTS systems converge on this representation

The inversion side: vocoders and their constraints

Key hyperparameters and their effects

When it falls down

Further reading