Vision & Multimodal
Duration Modelling and Alignment
Duration modelling assigns how many audio frames each phoneme occupies; alignment is the mechanism that learns or infers that mapping from text-audio pairs without manual annotation.
intermediate · 8 min read
A single word like "stretched" can last 300 ms or 900 ms depending on speaker, emotion, and surrounding context. Every neural TTS system must answer the same question before it can synthesise anything: which output frames belong to which input token? That mapping is duration modelling and alignment, and it is the structural spine of modern speech synthesis.
The alignment problem
Text-to-speech is fundamentally a sequence-to-sequence task with a severe length mismatch. A five-phoneme word maps to anywhere from 20 to 200 mel-spectrogram frames at 22 kHz with a 256-sample hop. Early systems inherited attention from neural machine translation, letting the decoder attend over encoder states while generating frames one at a time.
Tacotron 2 (2018) used a location-sensitive attention that accumulated attention weights over time, nudging the model to step forward through the phoneme sequence. In practice this broke badly on long sentences: the attention would sometimes loop over a phoneme (producing stutters) or skip one entirely (producing deletions). Both are catastrophic in production TTS because users have no tolerance for robustness failures the way they accept occasional mispronunciations.
The core constraint that saves alignment is monotonicity: a speaker always says phoneme 1 before phoneme 2, never backwards. Enforcing that hard constraint at training or inference time is the main distinction between alignment methods.
Attention-based vs. duration-predictor approaches
There are two broad families:
| Approach | How alignment is obtained | Inference speed | Robustness |
|---|---|---|---|
| Soft attention (Tacotron) | Learned implicitly, frame-by-frame | Autoregressive (slow) | Fragile on long inputs |
| Hard monotonic alignment | Dynamic programming over trained aligner | Parallel (fast) | High |
| External aligner + duration predictor | MFA/CTC forced alignment offline | Parallel (fast) | High |
Soft attention treats alignment as a continuous distribution over encoder states at each decoder step. It is expressive but gives the model freedom to ignore the monotonicity constraint, which it exploits badly at the tails of the training distribution.
Hard monotonic alignment uses the forward-sum algorithm (also called the forward algorithm from HMMs) to find the most probable monotone path through a cost matrix between text and latent speech. Glow-TTS (Kim et al., NeurIPS 2020) built this directly into its training loop with Monotonic Alignment Search (MAS): at each training step the cheapest monotonic path through the log-likelihood matrix is computed with dynamic programming, and the path lengths give per-phoneme durations that supervise a separate duration predictor network.
External forced alignment uses a separate model, typically the Montreal Forced Aligner (MFA) built on Kaldi, to annotate every training utterance with phoneme boundaries before TTS training even starts. FastSpeech 2 (Ren et al., ICLR 2021) took this route. The duration predictor is then a lightweight 2-layer 1D convolutional network that regresses log-duration from encoder hidden states. At inference, the predictor outputs a duration per phoneme and the encoder sequence is upsampled by repeating each state the predicted number of times, giving a frame-aligned sequence that a decoder converts directly to a mel-spectrogram.
How a duration predictor is trained
In FastSpeech 2 the duration predictor loss is mean-squared error on log-durations (log avoids penalising long-duration errors proportionally more than short ones):
L_dur = MSE( log(d_pred + 1), log(d_gt + 1) )
d_gt comes from the forced aligner. The +1 prevents log(0) for phones with zero-frame duration, which happens occasionally with reduced vowels.
At inference the predictor outputs a real number; it is rounded to the nearest integer and clipped to at least 1 to avoid zero-length outputs. A length regulator then repeats each encoder hidden state d_i times:
# pseudo-code length regulator
expanded = []
for h_i, d_i in zip(encoder_states, durations):
expanded.extend([h_i] * int(d_i))
return stack(expanded) # shape: [sum(d), hidden_dim]
This is differentiable with respect to the encoder states (the repetition is a gather, not a sort), but not with respect to the durations themselves because rounding breaks the gradient. Duration is therefore trained as an auxiliary head with its own loss, not through the main reconstruction path.
VITS (Kim et al., ICML 2021) went further: it replaces the deterministic duration predictor with a stochastic duration predictor that models a distribution over durations using normalising flows. The stochastic predictor generates diverse rhythms from the same text, which is crucial for conversational and expressive voices where the same sentence can be spoken with genuinely different timing.
Monotonic Alignment Search in detail
MAS, as introduced in Glow-TTS, operates on a log-likelihood matrix Q of shape [T_text, T_mel] where Q[i, j] is the log-probability of mel frame j being generated by text token i under the current flow model. The optimal monotone alignment is found by:
# dynamic programming (simplified)
for j in range(T_mel):
for i in range(T_text):
Q_path[i, j] = Q[i, j] + max(
Q_path[i-1, j-1], # stay on same token then advance mel
Q_path[i, j-1], # advance mel, keep same token
)
This is O(T_text * T_mel) per step, which is fast enough for typical utterance lengths but becomes expensive for very long inputs (audiobook-length sentences). The path is then back-traced to extract per-token durations, which supervise the duration predictor as a side product. Crucially, MAS requires no external aligner: alignment and the acoustic model co-train from scratch.
"One TTS Alignment To Rule Them All" (Badlani et al., 2021, ICASSP 2022) generalised MAS by adding the Viterbi algorithm and static phone duration priors, showing the framework could be grafted onto Tacotron 2, FastPitch, and FastSpeech 2 without architecture changes.
When it falls down
Short or reduced phonemes. Schwa vowels and stop bursts can last fewer than 5 ms. A 256-sample hop at 22 kHz is about 11.6 ms per frame. The duration predictor often collapses these to 1 frame; if forced alignment assigned 0, the clip-to-1 heuristic introduces a systematic lengthening bias on fast speech.
Highly expressive or singing voice. Duration statistics shift dramatically between neutral read speech and emotional or sung output. A predictor trained on audiobook data will mistime disfluencies, long holds, and glottal fry. Fine-tuning on target-style data is usually mandatory, not optional.
Out-of-vocabulary or rare grapheme clusters. If the text normaliser passes an unusual token (an abbreviation, a numeral not covered by the normaliser), the encoder representation is noisy and the duration predictor has no reliable signal to latch onto. The result is often an ultra-short or ultra-long segment that sounds like a glitch.
Alignment drift in MAS during early training. Because MAS uses the current model's likelihood to compute the alignment, and the model starts with random weights, early alignments are garbage. Training can converge to a degenerate solution (all frames assigned to one token) if the model is not warm-started carefully or if the alignment cost is not well-conditioned. Diagonal alignment priors (a Gaussian prior encouraging attention to stay near the diagonal) are a practical fix.
Long-form synthesis. Both attention-based systems and MAS become less reliable beyond roughly 200 phonemes per utterance. Most production pipelines segment text into sentences before synthesis for this reason.
Further reading
- FastSpeech 2: Fast and High-Quality End-to-End Text to Speech (Ren et al., ICLR 2021)
- Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search (Kim et al., NeurIPS 2020)
- One TTS Alignment To Rule Them All (Badlani et al., 2021)
- Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech / VITS (Kim et al., ICML 2021)