← Concept library

Vision & Multimodal

The Acoustic Model and Vocoder Split

Modern neural TTS splits the problem into two specialised sub-networks - an acoustic model that maps text to a compact spectral representation, and a vocoder that reconstructs a full audio waveform from that representation.

intermediate · 8 min read

The problem with predicting 24,000 numbers per second

A single second of CD-quality audio at 24 kHz is a sequence of 24,000 floating-point samples. Asking one network to map a sentence of perhaps 60 phonemes directly onto that sequence is technically possible, but it creates a brutal mismatch: the input changes on the timescale of phonemes (tens of milliseconds), while the output must be faithful at the level of individual cycles of a 4 kHz vowel (250 microseconds). The two timescales differ by roughly two orders of magnitude.

The standard engineering response is the same one used everywhere in signal processing: decouple timescales with an intermediate representation. For speech, that representation is the mel spectrogram - a 2-D matrix of short-time Fourier energy projected onto a perceptually motivated frequency axis, computed every 10-12.5 ms. A one-second utterance becomes roughly 80-100 frames, each a vector of 80 mel bins. That is a far more tractable target for sequence modelling than raw audio.

The result is the canonical two-stage pipeline:

text  →  [acoustic model]  →  mel spectrogram  →  [vocoder]  →  waveform

Each stage can be trained, swapped, and improved independently. That modularity is not a cosmetic convenience; it is the reason the field moved as fast as it did between 2017 and 2022.

What the acoustic model actually does

The acoustic model is responsible for everything linguistic: pronunciation, prosody, rhythm, stress, and duration. Its input is a sequence of phonemes (or characters, subword tokens, or a mixture), and its output is a sequence of mel spectrogram frames.

Tacotron 2 (Shen et al., ICASSP 2018) established the design pattern most subsequent systems follow: an encoder that computes context-rich phoneme embeddings, an attention mechanism that decides which part of the input each output frame attends to, and an autoregressive decoder that produces one mel frame at a time conditioned on all previous frames. The attention alignment is the core of the model - it must learn that the phoneme /a/ lasts roughly 70-100 ms, that stressed syllables are louder and longer, and that a question ends with rising pitch.

Non-autoregressive acoustic models like FastSpeech (Ren et al., NeurIPS 2019) replaced the attention mechanism with an explicit duration predictor: each phoneme is assigned a count of frames, the phoneme sequence is expanded by replication, and a feed-forward Transformer then refines the frames in parallel. This trades the naturalness of autoregressive attention for 270x faster inference and near-elimination of attention-failure artefacts (skipped words, babbling loops).

Neither variant produces audio. They produce an 80-dimensional time series. Crucially, this representation discards the fine-grained phase information of the original waveform - a choice that is deliberate and consequential.

Why phase is discarded, and why that matters

A mel spectrogram retains magnitude information but not phase. Phase recovery from magnitude alone (via classical algorithms like Griffin-Lim) introduces audible artefacts - metallic, buzzy tones - because there is no unique phase consistent with a given magnitude pattern.

The vocoder's job is to synthesise a plausible, perceptually clean waveform consistent with the mel magnitudes. It does not recover the original phase; it invents a new one. This is why two different vocoders run on the same mel spectrogram produce audio that sounds similar but is not numerically identical.

Early neural vocoders were autoregressive. WaveNet (van den Oord et al., 2016) models the waveform as:

p(x) = ∏ p(x_t | x_1, ..., x_{t-1}, mel)

Each sample is predicted conditioned on all previous samples and the mel conditioning. This gives very high audio quality, but sequential generation at 24 kHz is slow. WaveNet conditioned on mel spectrograms was the vocoder used in the original Tacotron 2 pipeline.

Flow-based vocoders like WaveGlow (Prenger et al., 2019) train an invertible bijection between Gaussian noise and audio; generation is a single forward pass through the inverse network, enabling real-time synthesis. GAN-based vocoders like HiFi-GAN (Kong et al., NeurIPS 2020) use a generator and multiple discriminators that enforce periodicity at different temporal scales. HiFi-GAN produces 22.05 kHz audio at 167x real-time on a V100 GPU, with quality that is essentially indistinguishable from human speech in standard MOS evaluations.

The vocoder sees no text. It knows nothing about what was said. It only knows the mel magnitudes, frame by frame.

The split enables independent improvement

The two-stage architecture allows a clean substitution interface: any acoustic model that outputs mel spectrograms can be paired with any vocoder that accepts them. This has concrete practical benefits.

Voice cloning typically fine-tunes only the acoustic model with a small speaker-specific dataset (a few minutes of speech), while the vocoder remains fixed. The reasoning is that voice identity is largely captured in the acoustic model's output - pitch contours, formant positions, speaking rate - whereas the vocoder only needs to invert the spectrogram faithfully. A well-trained universal vocoder generalises across speakers.

Latency optimisation can be applied to each stage separately. A streaming acoustic model can generate mel frames incrementally; a streaming vocoder can consume them and produce audio chunks before the full spectrogram is available. Pipelining the two stages gives much lower end-to-end latency than either stage alone could achieve.

Evaluation also decomposes cleanly. A degraded mel spectrogram (blurry, jittered, missing frames) and a degraded vocoder produce different perceptual artefacts. Researchers can run analysis of synthesis - feeding ground-truth mel spectrograms into the vocoder to isolate its error - before evaluating the acoustic model's contribution.

Stage Input Output Key challenge
Acoustic model Phoneme/text sequence Mel spectrogram (80 x T) Duration, prosody, attention alignment
Vocoder Mel spectrogram (80 x T) Raw waveform (24000 x T/100) Phase synthesis, periodicity, real-time speed

When it falls down

Prosody averaging. Acoustic models trained on large, diverse corpora learn to produce "average" prosody: flat pitch, moderate pace, limited expressiveness. The mel representation is a magnitude-only summary; subtle paralinguistic cues (irony, emphasis, emotional colouring) that live partly in dynamics and breathiness are smoothed out. This is not a failure of any specific model so much as a structural limitation of training with a reconstruction objective against a single reference waveform.

Vocoder artefacts on out-of-distribution mel inputs. A vocoder trained on clean studio speech will produce audible buzzing or metallic tones when fed a mel spectrogram from a distant microphone, a noisy recording, or a synthesised mel that deviates from the training distribution. The vocoder has no mechanism for graceful degradation; it will try to invent a waveform consistent with whatever it receives.

Mismatch accumulation. The acoustic model and vocoder are usually trained independently (sometimes called the two-stage training problem). Errors in the mel prediction are not seen by the vocoder during its training, so the vocoder is not robust to exactly the kinds of errors the acoustic model makes. End-to-end fine-tuning can partially address this, but it complicates the modular substitution story.

Long-form coherence. Acoustic models with limited context windows (fixed receptive field, no global attention) lose track of discourse-level prosody over long utterances. A three-minute passage will sound like a sequence of independent sentences rather than a coherent monologue.

Real-time constraint on low-power hardware. GAN vocoders like HiFi-GAN are fast on server GPUs but non-trivial to run on mobile or embedded hardware. Smaller distilled variants exist, but they trade quality for efficiency.

Further reading

Sign in to save and react.
Share Copied