← Concept library

Vision & Multimodal

Attention Failures in TTS

Attention-based TTS systems fail in predictable ways - word skipping, repetition, and unstable alignment - and understanding the mechanics behind each failure mode is essential for building reliable speech synthesis pipelines.

intermediate · 7 min read

Tacotron trained on 24 hours of a single speaker can, without warning, skip the word "particularly" in a sentence and repeat "the" three times. No loss spike, no gradient explosion - the model converges cleanly and then misbehaves at inference on certain inputs. This is the core tension of attention-based TTS: the alignment mechanism that makes end-to-end training elegant is also a fragile implicit contract between encoder and decoder that breaks under pressure.

What attention is doing in TTS

In sequence-to-sequence TTS (Tacotron, Tacotron 2, and their descendants), an encoder maps a character or phoneme sequence into context vectors. The decoder generates mel-spectrogram frames one at a time, and at each step an attention mechanism computes a probability distribution over encoder outputs - this is the "alignment".

A correct alignment looks like a near-diagonal band when visualised as a matrix of decoder steps vs. encoder positions. The decoder should attend to roughly one phoneme per frame, advancing left-to-right in lockstep with the text. The model learns this monotonic behaviour entirely from data: nothing in vanilla soft attention forces monotonicity. It emerges as a statistical pattern, which means it can also fail to emerge - or emerge and then dissolve.

The location-sensitive attention used in Tacotron 2 adds a cumulative attention weight as input to the attention query, nudging the mechanism toward monotonic left-to-right progress:

e_i,t  = score(s_t, a_{t-1}, h_i)
       = v^T tanh(W s_t  +  V h_i  +  U * (f * a_{t-1}))

Here a_{t-1} is the previous attention weights, f is a learned convolutional filter, and s_t is the decoder state. The cumulative convolution biases the model away from positions it has already attended to, which improves robustness but does not eliminate failures.

The three canonical failure modes

1. Skipping (under-attention)

The attention weight distribution jumps forward too fast, skipping one or more phonemes. The resulting audio omits syllables or entire words. Skipping is more common on:

  • Long or rare words where the encoder representation is less confident.
  • Words at sentence boundaries where there is no contextual pull from the following phoneme.
  • Polysyllabic words with reduced stress (function words like "particularly", "necessarily").

The diagnostic signature is a gap in the alignment diagonal: the decoder steps advance without any frame spending significant weight on certain encoder positions.

2. Repetition (over-attention)

The attention distribution stalls and loops, re-attending to the same encoder position for many consecutive frames. The decoder outputs the same phoneme repeatedly, often producing a stutter or an indefinitely long segment. This is the most audible failure: listeners notice repetition immediately.

Repetition tends to cluster on:

  • Words near the end of long sentences (the decoder has accumulated error over many steps).
  • Words that are acoustically similar to their neighbours - the attention has multiple locally plausible optima.
  • Inputs that exceed the maximum length seen during training (length extrapolation).

3. Attention collapse / diagonal drift

A subtler failure where the alignment is approximately correct but the diagonal drifts - the decoder attends to a weighted average of a wide spread of encoder positions rather than a sharp peak. The resulting audio is intelligible but has degraded prosody: vowels are slightly wrong, consonants are smeared, and the speaking rate fluctuates. This failure often goes undetected in MOS evaluations because raters are less sensitive to prosodic quality than to outright intelligibility errors.

A table of failure modes and their diagnostic signatures:

Failure mode Alignment signature Audio artefact Typical cause
Skipping Gap in diagonal Missing phoneme or word Rare/long word, insufficient context
Repetition Stalled diagonal Stutter or looped segment End-of-sentence, length OOD
Diagonal drift Wide, blurry attention peak Smeared prosody, rate variation Insufficient training data or length
Off-diagonal Random scatter Unintelligible output Training collapse

Why standard soft attention is the root cause

Soft attention with a learned query/key/value decomposition has no structural bias toward monotonicity. In machine translation this is acceptable: a decoder might legitimately need to attend to the beginning of the source again late in decoding. In TTS it is not: speech unfolds in time, and there is no scenario in which the phoneme for frame 200 should depend on the first phoneme of the sentence.

The implicit learning of monotonicity is sample-hungry. Tacotron needs many thousands of (text, audio) pairs before the attention diagonal stabilises during training. With small datasets (under a few hours of speech), training may complete without the diagonal ever fully forming - the model memorises rather than generalises, and inference on new text degrades immediately.

Furthermore, because the attention is autoregressive - each frame's alignment depends on the previous frame's alignment - errors compound. A single frame where the attention weight drifts slightly forward can cause a cascading misalignment over the following frames.

Mitigations (and their trade-offs)

Guided attention loss (used in several Japanese TTS systems) adds a soft penalty during training that discourages non-diagonal attention weights. This accelerates alignment learning significantly and reduces failure rates on short to medium sentences. The penalty takes the form:

L_guided = mean( W_{n,t} * A_{n,t} )
W_{n,t}  = 1 - exp( -(n/N - t/T)^2 / (2g^2) )

where n is encoder step, t is decoder step, g is a width hyperparameter. The cost is a modest regularisation effect that can slightly reduce expressiveness on unusual prosody.

Monotonic attention variants (Monotonic Chunkwise Attention, MoChA) constrain the attention to only move forward, making repetition impossible. They introduce their own artefact: hard boundaries can produce audible discontinuities at chunk edges.

Duration-based architectures (FastSpeech, FastSpeech 2) sidestep attention failures entirely by predicting a scalar duration per phoneme and using a length regulator to expand the phoneme sequence to match the mel frame count. There is no cross-attention to fail. The trade-off is that duration prediction is a separate supervised task requiring ground-truth durations extracted from a teacher model - adding a dependency on the very attention system you are trying to replace.

Codec language models (VALL-E and successors) shift the problem: they use cross-attention over a prompt audio token sequence rather than over text, and operate in the discrete codec token space. Attention failures still occur but manifest differently - as token repetition in the codec sequence rather than as phoneme skipping.

When it falls down

Even with location-sensitive attention, guided loss, and careful training, failures persist in these scenarios:

  • Long inputs. Sentences beyond roughly 100 characters (or 50 phonemes) stress attention robustness, because the model must maintain monotonic progress over more steps than it typically saw during training. The failure rate increases roughly super-linearly with input length.
  • Number and abbreviation normalisation errors. If the text normaliser fails to expand "Dr. Smith will see you at 3pm on 12/4" correctly, the resulting phoneme sequence has unusual structure that the encoder has rarely seen. The attention has no defence against encoder representations for unseen symbol sequences.
  • Very fast or very slow target speaking rates. Prosody conditioning (pitch and rate tokens, style embeddings) shifts the number of frames per phoneme. If the guidance pushes rate outside the training distribution, the decoder expects more or fewer frames per encoder step than the attention was trained to produce.
  • Voice cloning with a short prompt. Adapting a model to a new speaker on a few minutes of data often produces a model that has overfit the speaker style but underfit the alignment mechanism. Attention failures are disproportionately common in low-resource cloning scenarios.
  • Inference without a stop token. Tacotron-style models must learn when to stop generating. If the stop prediction is uncertain (another learned signal), the decoder may continue past the end of the text, looping on silence or on the last phoneme.

Further reading

Sign in to save and react.
Share Copied