← Concept library

Vision & Multimodal

Tacotron 2

Tacotron 2 is a two-stage neural TTS pipeline that converts text to mel spectrograms with a sequence-to-sequence model, then synthesises raw audio with a conditioned WaveNet vocoder, achieving near-human MOS scores.

intermediate · 8 min read

A professional voice actor reading a sentence achieves a Mean Opinion Score (MOS) of roughly 4.58 on a 5-point naturalness scale. When Google published Tacotron 2 in late 2017, their system hit 4.53 on the same scale. The gap was barely perceptible to human listeners, yet the system required no hand-crafted acoustic features, no phoneme dictionaries, and no signal-processing pipelines beyond a standard mel filterbank. It operated on raw characters as input and raw waveform samples as output.

That result raised an immediate question: what architectural choices made such parity possible, and where does the system still break?

The Two-Stage Architecture

Tacotron 2 cleanly separates two concerns that earlier TTS pipelines conflated.

Stage 1: Text to mel spectrogram. A sequence-to-sequence network (encoder + attention + decoder) reads a character sequence and autoregressively predicts 80-bin mel spectrograms, 50 ms frames at a time. The encoder stacks three convolutional layers followed by a bidirectional LSTM; the decoder is a two-layer LSTM with a location-sensitive attention mechanism.

Stage 2: Mel spectrogram to waveform. A modified WaveNet vocoder conditions on the mel frames output by Stage 1 and generates 24 kHz 16-bit audio sample by sample. Because the spectrogram supplies a compact, invertible intermediate representation, the vocoder needs far less capacity than a WaveNet trained from text directly.

This separation matters: the two stages can be trained independently, failures can be diagnosed in isolation, and either stage can be swapped. Subsequent work (e.g., replacing WaveNet with WaveGlow or HiFi-GAN) left Stage 1 essentially unchanged.

Characters  →  [Encoder: 3× Conv + BiLSTM]
                         ↓
              [Location-Sensitive Attention]
                         ↓
              [Decoder: 2× LSTM + PreNet]
                         ↓
             Mel Spectrogram (80 bins, T frames)
                         ↓
              [WaveNet Vocoder (conditioned)]
                         ↓
             Waveform (24 kHz)

The Mel Spectrogram as Intermediate Representation

Choosing the mel spectrogram as the bridge between the two stages was one of the paper's central design decisions, and it is worth understanding why.

A raw waveform at 24 kHz has 24,000 samples per second. Modelling that sequence directly from text is an extreme compression problem: the encoder must maintain coherent state across a ratio of roughly 12,000 samples per character. Mel spectrograms compress this to around 400 frames per second, a 60x reduction in temporal resolution, while preserving the perceptually relevant frequency content (the mel scale approximates how the cochlea responds to frequency).

Because mel spectrograms are both compact and perceptually interpretable, they are easier to predict accurately with a sequence-to-sequence model, and a well-trained vocoder can invert them into high-fidelity audio. The original Tacotron used linear-scale spectrograms reconstructed with the Griffin-Lim algorithm; the switch to mel scale plus a learned vocoder accounts for much of the quality jump between versions.

Attention and the Stop Token

The attention mechanism in Tacotron 2 is location-sensitive: in addition to the encoder hidden states and the current decoder state, it also conditions on the cumulative attention weights from previous decoder steps. This nudges the model to advance through the input rather than re-attending to the same region.

A dedicated stop-token prediction head runs in parallel with the spectrogram prediction at every decoder step. When the head fires (sigmoid output crosses 0.5), decoding stops. This is simpler than the hand-tuned heuristics earlier systems used to decide utterance length, but it introduces a failure mode discussed below.

Component Input Output
Encoder Conv layers Characters (embedded) Local feature maps
Encoder BiLSTM Feature maps Contextual hidden states
Location-sensitive attention Encoder states, decoder state, cumulative weights Context vector
Decoder LSTM Context + previous mel frame via PreNet Hidden state
Mel projection Decoder hidden state 80-dim mel frame
Stop-token head Decoder hidden state Scalar (sigmoid)

The PreNet (two fully-connected layers with 0.5 dropout applied even at inference) sits between the previous mel frame and the decoder LSTM. Keeping dropout active at inference acts as a noise injection that prevents the decoder from over-relying on the autoregressive feed, which would cause error accumulation on long utterances.

Training and Inference Details

Both stages are trained on the LJ Speech dataset: roughly 24 hours of a single English speaker reading public-domain texts, sampled at 22.05 kHz (Stage 2 upsamples to 24 kHz). The sequence-to-sequence model is trained with teacher forcing: the decoder receives ground-truth mel frames rather than its own predictions during training. At inference, it switches to fully autoregressive mode.

The loss is a sum of two mean-squared-error terms (pre-net output and the post-net output, a five-layer convolutional residual added to sharpen the spectrogram) plus a binary cross-entropy loss for the stop token.

The WaveNet vocoder is trained separately, conditioned on ground-truth mel spectrograms, so it never sees the (slightly imperfect) Stage 1 outputs during training. This mismatch between training and inference conditions is a known source of artefacts, particularly at the start and end of utterances.

When It Falls Down

Attention failures on long or unusual inputs. Location-sensitive attention reliably advances through short, well-punctuated sentences. On inputs longer than roughly 200 characters, or on inputs with repeated substrings, the attention can skip a region entirely (skipping words) or loop back (repeating words). This is not a corner case; it occurs frequently enough that production deployments wrap Tacotron 2 with length limits and fallback logic.

Stop-token unreliability. The sigmoid stop head can fire early on sentences ending with unstressed syllables, cutting off the final phoneme. It can also fail to fire, generating silence frames indefinitely. Neither failure mode is trivially recoverable without a secondary heuristic.

Slow inference due to autoregression. Generating one second of speech at 12.5 frames per second requires 12.5 sequential decoder steps plus one WaveNet pass. On CPU, Stage 2 alone runs well below real-time (the original WaveNet required a GPU to approach real-time). This bottleneck drove the development of parallel vocoders (WaveGlow, HiFi-GAN) that replaced Stage 2 with a flow-based or GAN-based model capable of real-time synthesis.

Single-speaker training by default. The published model is trained on one speaker. Multi-speaker extension requires conditioning the model on speaker embeddings (e.g., d-vectors), which adds complexity and data requirements, and still struggles with highly accented or low-resource voices.

Training/inference mismatch in the vocoder. Because the WaveNet is trained on ground-truth mel spectrograms, it is optimised for a distribution slightly different from Stage 1's outputs. Spectral smearing and slight pitch instability at prosodic boundaries are direct consequences.

No control over prosody. The base model provides no interface for manipulating pitch, rate, or emphasis. Global Style Token (GST) conditioning was a popular extension but adds significant training complexity.

Despite these limits, Tacotron 2 remains a standard baseline. Its architecture is simple enough to understand fully, its failure modes are well-characterised, and it defined the mel-spectrogram-as-intermediate-representation idiom that nearly every subsequent neural TTS system has retained.

Further Reading

Sign in to save and react.
Share Copied