FastSpeech and Non-Autoregressive TTS

Tacotron 2 at inference time generates mel-spectrogram frames one by one: each step attends over the encoder output and then conditions on the frame it just produced. On a 2019 GPU, that meant roughly 50x real-time latency for a 10-second utterance, and any attention mis-alignment could cause a character to be skipped or repeated with no way to recover mid-sentence. FastSpeech (Ren et al., NeurIPS 2019) asked a pointed question: what if we simply removed the recurrence?

The core problem with autoregressive attention

Sequence-to-sequence TTS models like Tacotron 2 use soft attention to align encoder hidden states (one per phoneme or character) to decoder steps (one per mel frame). This works well on average, but the alignment must be learned implicitly. Nothing in the loss prevents the decoder from attending to the wrong position, and at inference time there is no ground-truth teacher signal to keep it on track.

Two failure modes appear in practice:
- Word skipping: the attention jumps forward too fast, omitting a phoneme.
- Word repetition: attention stays on one position too long, repeating a syllable.

Both are worse for long sentences, unusual names, and domain-shifted text. Making the alignment explicit is the obvious fix; the question is how.

The FastSpeech architecture

FastSpeech replaces the recurrent decoder with a feed-forward Transformer (FFT) stack and inserts a length regulator between the encoder and decoder.

Text tokens
    |
  [Encoder FFT blocks]  <-- Transformer blocks, self-attention only
    |
  [Duration Predictor]  <-- shallow 2-layer CNN predicting log-duration per phoneme
    |
  [Length Regulator]    <-- repeats each encoder hidden state d_i times
    |
  Expanded sequence (one frame per mel column)
    |
  [Decoder FFT blocks]  <-- same architecture as encoder
    |
  Linear projection -> mel-spectrogram (all frames simultaneously)

The length regulator is the key piece. If a phoneme's predicted duration is d_i frames, its encoder hidden vector is replicated d_i times to produce the input to the decoder. The decoder then refines all frames in parallel, with no step-to-step dependency.

During training, durations come from a pre-trained autoregressive teacher model: the attention weights are extracted and converted to hard alignments via a dynamic programming step. This teacher-student dependency was the main criticism of FastSpeech 1.

The duration predictor is trained with mean-squared error on log-durations. At inference time its outputs are rounded to integers, giving fully deterministic, alignment-safe synthesis.

FastSpeech 2: removing the teacher

FastSpeech 2 (Ren et al., ICLR 2021) removed the teacher-student pipeline and added two new predictors alongside duration: pitch and energy. All three are extracted from the ground-truth waveform during training using tools like pyworld (for F0) and frame-level L2 norm (for energy), then quantised into discrete bins. Each becomes a learnable embedding added to the encoder output.

Predictor	Extraction tool	Conditioning
Duration	MFA / forced aligner	Length regulator
Pitch (F0)	WORLD vocoder (pyworld)	Embedding lookup
Energy	Frame-level L2 norm	Embedding lookup

This matters for two reasons. First, it eliminates the need to train a separate autoregressive teacher, simplifying the pipeline. Second, pitch and energy embeddings give operators a knob: dial pitch up 20% and the model produces a higher-voiced reading. This is controllable, not just fast.

FastSpeech 2s (the "s" suffix) extends the decoder to predict raw waveform samples directly, bypassing the vocoder entirely. In practice, separate vocoders like HiFi-GAN still tend to win on perceptual quality, so FastSpeech 2 + HiFi-GAN remains the dominant pairing in production systems.

Inference speedup in practice

On a single V100, FastSpeech 1 reported roughly 38x real-time factor for mel-spectrogram generation (compared to roughly 1x for Tacotron 2 at the time). With a parallel vocoder the full pipeline can exceed 20x real-time on a modest GPU, making on-device deployment feasible for the first time.

The broader class of non-autoregressive TTS models

FastSpeech established a template that later models follow: (1) an explicit duration or alignment model that decouples input length from output length, (2) a parallel generative network conditioned on the aligned representation.

Glow-TTS (Kim et al., NeurIPS 2020) replaces the feed-forward decoder with a normalising flow and learns its own monotonic alignment during training via dynamic programming in latent space. It avoids teacher-student distillation and produces sharper mel-spectrograms at the cost of more complex training.

VITS (Kim et al., ICML 2021) goes end-to-end: a variational autoencoder encodes waveforms, a flow-based prior models the phoneme-to-latent alignment, and a GAN decoder (HiFi-GAN-style) synthesises waveforms directly. Duration is still modelled explicitly as a stochastic variable. VITS achieves human-competitive MOS on LJSpeech without a separate vocoder.

The common thread is that explicit alignment - whether from a forced aligner, a learned monotonic attention, or a latent flow - is what makes parallel generation tractable.

When it falls down

Duration errors compound loudly. In autoregressive models, a slow character just means one extra frame; the model self-corrects. In FastSpeech, an incorrect duration prediction stretches or compresses an entire phoneme's acoustic features. Rounding errors on short phonemes (plosives under 30ms) are particularly audible.

Prosody is flatter on average. Autoregressive models, by conditioning each frame on the previous one, implicitly model fine-grained prosodic dependencies within a word. FastSpeech's frame-level parallelism breaks this within-word conditioning. The result is often described as more "neutral" or "robotic" on expressive read-out-loud tasks, even when pitch and energy predictors are present.

Pitch predictor generalisation. The pitch predictor sees training-domain F0 distributions. On out-of-domain text (questions, exclamations, domain-specific jargon) the predictor may revert to a flat mean. Prosody transfer or reference encoder conditioning is needed to handle stylistic extremes.

Multi-speaker and voice cloning are harder. The speaker embedding modulates the entire network, but duration statistics are also speaker-dependent. A single duration predictor shared across speakers tends to perform worse than speaker-conditioned predictors, requiring either separate heads or a more sophisticated conditioning scheme.

Forced aligner dependency. FastSpeech 2 requires a Montreal Forced Aligner (MFA) pass over the training corpus. MFA works well for standard English but fails on languages with limited pronunciation dictionaries, low-resource settings, or singing voice data. Glow-TTS and VITS sidestep this at the cost of training complexity.