← Concept library

Vision & Multimodal

The TTS Problem and Pipeline

Text-to-speech converts a string of characters into a waveform through a chain of normalisation, acoustic modelling, and synthesis stages, each introducing its own failure modes.

beginner · 7 min read

A professional human voice actor reading a novel produces about 150 words per minute. A modern neural TTS system like Google's WaveNet-based pipeline achieves a mean opinion score (MOS) of 4.53 out of 5.0 on a 5-point naturalness scale, within 0.05 points of a real recording. Getting there required solving at least four deeply different engineering problems in sequence - and the sequence matters, because errors compound.

What the pipeline actually does

TTS is not one model; it is a cascade. A character string enters one end and a waveform exits the other. The stages in between are:

Stage Input Output Main challenge
Text normalisation Raw text Spoken-form tokens Ambiguity, abbreviations, heteronyms
Grapheme-to-phoneme (G2P) Tokens Phoneme sequence Out-of-vocabulary words, proper nouns
Prosody / duration prediction Phonemes Timed phoneme sequence Rhythm, stress, phrasing
Acoustic model Timed phonemes Acoustic features (mel spectrogram) Naturalness, expressiveness
Vocoder / codec Mel spectrogram Raw waveform at ~22-44 kHz Audio fidelity, artefacts

Each stage can be a separate trained model, or several stages can be folded into one end-to-end model (the trend since 2018). But even in an end-to-end system, the conceptual boundaries remain because the failure modes align with them.

Text normalisation: the part everyone underestimates

A TTS system receives whatever text a product engineer feeds it. That text is almost never clean. Consider:

"The CEO earned $1.3M in FY22, up from $980K the year prior."

How should a system say $1.3M? "One point three million dollars"? "One dollar three M"? What about FY22? "Financial year twenty-two" or "F Y 2022"? What about a date written 06/08/26 in the US versus in the UK?

Normalisation rules are typically a finite-state transducer (FST) grammar or a small seq2seq model that converts written-form tokens into spoken-form tokens before any phoneme lookup happens. The failure here is quiet: a miscategorised abbreviation or number format produces an intelligible but wrong pronunciation, and users often blame the voice rather than the data pipeline.

Heteronyms are a harder problem. The string read is pronounced differently in "I will read this" versus "I have read this." Resolving this requires part-of-speech context, which in turn requires at least a shallow parse. Most production systems handle a curated list of heteronyms with rule priority; the tail is long.

Acoustic models: from phonemes to spectrograms

The acoustic model is where most of the last decade's research energy went. The breakthrough architecture was the sequence-to-sequence model with attention (Tacotron, 2017), followed by Tacotron 2 (2018), which conditions a modified WaveNet vocoder on mel-spectrogram predictions and achieves that near-human MOS of 4.53.

The acoustic model takes a sequence of phoneme (or character) tokens and outputs a mel spectrogram - a time-frequency representation where the y-axis is perceptual frequency (mel scale) and the x-axis is time in frames. A typical 22,050 Hz system uses 80 mel bins and 256-sample hop size, giving roughly 86 frames per second.

The key design decisions here are:

  • Attention mechanism: how the model learns which input phonemes correspond to which output frames. Hard monotonic attention works well for TTS because speech is approximately left-to-right; unconstrained attention can "skip" or "repeat" phonemes.
  • Duration modelling: non-autoregressive models (FastSpeech, 2019) predict explicit phoneme durations, which removes the attention failure mode and speeds up inference by 38x over autoregressive baselines at comparable quality.
  • Conditioning: speaker identity, speaking style, or emotional tag can be provided as an embedding, enabling multi-speaker and expressive TTS without separate model copies.

Vocoders and neural audio codecs

The acoustic model's mel spectrogram is not audio. A vocoder converts it into a waveform. Two families matter:

Neural vocoders (WaveNet, WaveGlow, HiFi-GAN) directly synthesise samples. WaveNet is autoregressive - it conditions each sample on all previous samples - giving high quality but slow inference (real-time factors below 1x on CPU without distillation). HiFi-GAN and WaveGrad use non-autoregressive generative adversarial networks or diffusion, recovering near-WaveNet quality at near-real-time speed.

Neural audio codecs (SoundStream, EnCodec) take a different approach: they learn a compressed discrete representation of audio using a residual vector quantiser (RVQ). SoundStream demonstrated that a fully convolutional encoder-decoder with RVQ can compress speech to 3 kbps while outperforming classical Opus codec at 12 kbps. EnCodec (Meta, 2022) extended this to 24 kHz mono and 48 kHz stereo with a multiscale spectrogram discriminator and a loss balancer that stabilises training.

These codecs matter for TTS because they enable a third generation of synthesis: codec language models. VALL-E (Microsoft, 2023) reframes TTS as conditional language modelling over EnCodec tokens. Given a 3-second recording of a speaker's voice plus a text prompt, VALL-E generates the corresponding EnCodec token sequence, which the codec then decodes to a waveform. This zero-shot voice cloning capability was not feasible with waveform-domain vocoders at the same quality level.

The conceptual shift is significant:

Classical pipeline:   text -> phonemes -> mel spectrogram -> waveform
Codec LM pipeline:    text + speaker tokens -> codec tokens -> waveform

The acoustic model and vocoder collapse into one conditional language model trained on quantised audio. The normalisation and G2P stages still exist upstream, though VALL-E's training on 60,000 hours of English speech gives it some implicit tolerance for orthographic variation.

Voice cloning and speaker adaptation

Zero-shot voice cloning (as in VALL-E) requires only a short reference clip and works without fine-tuning. The speaker's voice identity is encoded into a prompt embedding that conditions the codec token generation.

Speaker adaptation (fine-tuning an existing model on 5-30 minutes of a target speaker's audio) was the dominant approach before 2023. It gives more reliable reproduction of unusual voice characteristics - regional accent, laryngeal quality, speaking rate - because it bakes speaker-specific priors into model weights rather than relying on prompt conditioning to generalise.

The two approaches are not mutually exclusive: a zero-shot model can be fine-tuned for production deployment of a known speaker, trading flexibility for fidelity.

When it falls down

Even with near-human MOS scores at the benchmark sentence level, TTS systems fail in predictable ways:

  • Long sentences with complex prosody. A model trained on read speech degrades on spontaneous-style text with embedded lists, hedges, and restarts. The acoustic model's attention or duration predictor was not exposed to this distribution.
  • Rare proper nouns and code-switched text. "Nguyen" and "Ptolemy" expose G2P failures instantly. Mixing languages ("please open the fichier") confuses phoneme lookup unless the normaliser language-tags each segment.
  • Numeric and symbolic edge cases. Telephone numbers, IP addresses, mathematical expressions, and chemical formulae all require bespoke normalisation rules. A generic rule set will mispronounce 192.168.1.1 as "one hundred ninety-two point one hundred sixty-eight..."
  • Expressiveness and paralinguistics. Laughter, hesitation, emphasis, and whispering are outside the distribution of most read-speech training corpora. Models may refuse to produce them or produce a bland approximation.
  • Codec artefacts at low bitrate. At the quantiser's rate boundary, codec LM outputs can exhibit metallic ringing or muffled transients, particularly on fricatives and plosives.
  • Prosody in emotionally loaded text. A sentence like "he was, of course, not guilty" can carry sarcasm, genuine relief, or flat reportage. Without explicit emotion conditioning, the model picks whatever was most frequent in training.
  • Hallucinated phonemes. Autoregressive acoustic models and codec LMs can occasionally insert or repeat a word, particularly at sentence boundaries where the attention or language model loses coherence.

Further reading

Sign in to save and react.
Share Copied