The TTS Problem and Pipeline

A professional human voice actor reading a novel produces about 150 words per minute. A modern neural TTS system like Google's WaveNet-based pipeline achieves a mean opinion score (MOS) of 4.53 out of 5.0 on a 5-point naturalness scale, within 0.05 points of a real recording. Getting there required solving at least four deeply different engineering problems in sequence - and the sequence matters, because errors compound.

What the pipeline actually does

TTS is not one model; it is a cascade. A character string enters one end and a waveform exits the other. The stages in between are:

Stage	Input	Output	Main challenge
Text normalisation	Raw text	Spoken-form tokens	Ambiguity, abbreviations, heteronyms
Grapheme-to-phoneme (G2P)	Tokens	Phoneme sequence	Out-of-vocabulary words, proper nouns
Prosody / duration prediction	Phonemes	Timed phoneme sequence	Rhythm, stress, phrasing
Acoustic model	Timed phonemes	Acoustic features (mel spectrogram)	Naturalness, expressiveness
Vocoder / codec	Mel spectrogram	Raw waveform at ~22-44 kHz	Audio fidelity, artefacts

Each stage can be a separate trained model, or several stages can be folded into one end-to-end model (the trend since 2018). But even in an end-to-end system, the conceptual boundaries remain because the failure modes align with them.

Text normalisation: the part everyone underestimates

A TTS system receives whatever text a product engineer feeds it. That text is almost never clean. Consider:

"The CEO earned $1.3M in FY22, up from $980K the year prior."

How should a system say $1.3M? "One point three million dollars"? "One dollar three M"? What about FY22? "Financial year twenty-two" or "F Y 2022"? What about a date written 06/08/26 in the US versus in the UK?

Normalisation rules are typically a finite-state transducer (FST) grammar or a small seq2seq model that converts written-form tokens into spoken-form tokens before any phoneme lookup happens. The failure here is quiet: a miscategorised abbreviation or number format produces an intelligible but wrong pronunciation, and users often blame the voice rather than the data pipeline.

Heteronyms are a harder problem. The string read is pronounced differently in "I will read this" versus "I have read this." Resolving this requires part-of-speech context, which in turn requires at least a shallow parse. Most production systems handle a curated list of heteronyms with rule priority; the tail is long.

Acoustic models: from phonemes to spectrograms

The acoustic model is where most of the last decade's research energy went. The breakthrough architecture was the sequence-to-sequence model with attention (Tacotron, 2017), followed by Tacotron 2 (2018), which conditions a modified WaveNet vocoder on mel-spectrogram predictions and achieves that near-human MOS of 4.53.

The acoustic model takes a sequence of phoneme (or character) tokens and outputs a mel spectrogram - a time-frequency representation where the y-axis is perceptual frequency (mel scale) and the x-axis is time in frames. A typical 22,050 Hz system uses 80 mel bins and 256-sample hop size, giving roughly 86 frames per second.

The key design decisions here are:

Attention mechanism: how the model learns which input phonemes correspond to which output frames. Hard monotonic attention works well for TTS because speech is approximately left-to-right; unconstrained attention can "skip" or "repeat" phonemes.
Duration modelling: non-autoregressive models (FastSpeech, 2019) predict explicit phoneme durations, which removes the attention failure mode and speeds up inference by 38x over autoregressive baselines at comparable quality.
Conditioning: speaker identity, speaking style, or emotional tag can be provided as an embedding, enabling multi-speaker and expressive TTS without separate model copies.

Vocoders and neural audio codecs

The acoustic model's mel spectrogram is not audio. A vocoder converts it into a waveform. Two families matter:

Neural vocoders (WaveNet, WaveGlow, HiFi-GAN) directly synthesise samples. WaveNet is autoregressive - it conditions each sample on all previous samples - giving high quality but slow inference (real-time factors below 1x on CPU without distillation). HiFi-GAN and WaveGrad use non-autoregressive generative adversarial networks or diffusion, recovering near-WaveNet quality at near-real-time speed.

Neural audio codecs (SoundStream, EnCodec) take a different approach: they learn a compressed discrete representation of audio using a residual vector quantiser (RVQ). SoundStream demonstrated that a fully convolutional encoder-decoder with RVQ can compress speech to 3 kbps while outperforming classical Opus codec at 12 kbps. EnCodec (Meta, 2022) extended this to 24 kHz mono and 48 kHz stereo with a multiscale spectrogram discriminator and a loss balancer that stabilises training.

These codecs matter for TTS because they enable a third generation of synthesis: codec language models. VALL-E (Microsoft, 2023) reframes TTS as conditional language modelling over EnCodec tokens. Given a 3-second recording of a speaker's voice plus a text prompt, VALL-E generates the corresponding EnCodec token sequence, which the codec then decodes to a waveform. This zero-shot voice cloning capability was not feasible with waveform-domain vocoders at the same quality level.

The conceptual shift is significant:

Classical pipeline:   text -> phonemes -> mel spectrogram -> waveform
Codec LM pipeline:    text + speaker tokens -> codec tokens -> waveform

The acoustic model and vocoder collapse into one conditional language model trained on quantised audio. The normalisation and G2P stages still exist upstream, though VALL-E's training on 60,000 hours of English speech gives it some implicit tolerance for orthographic variation.

Voice cloning and speaker adaptation

Zero-shot voice cloning (as in VALL-E) requires only a short reference clip and works without fine-tuning. The speaker's voice identity is encoded into a prompt embedding that conditions the codec token generation.

Speaker adaptation (fine-tuning an existing model on 5-30 minutes of a target speaker's audio) was the dominant approach before 2023. It gives more reliable reproduction of unusual voice characteristics - regional accent, laryngeal quality, speaking rate - because it bakes speaker-specific priors into model weights rather than relying on prompt conditioning to generalise.

The two approaches are not mutually exclusive: a zero-shot model can be fine-tuned for production deployment of a known speaker, trading flexibility for fidelity.

When it falls down

Even with near-human MOS scores at the benchmark sentence level, TTS systems fail in predictable ways:

Long sentences with complex prosody. A model trained on read speech degrades on spontaneous-style text with embedded lists, hedges, and restarts. The acoustic model's attention or duration predictor was not exposed to this distribution.
Rare proper nouns and code-switched text. "Nguyen" and "Ptolemy" expose G2P failures instantly. Mixing languages ("please open the fichier") confuses phoneme lookup unless the normaliser language-tags each segment.
Numeric and symbolic edge cases. Telephone numbers, IP addresses, mathematical expressions, and chemical formulae all require bespoke normalisation rules. A generic rule set will mispronounce 192.168.1.1 as "one hundred ninety-two point one hundred sixty-eight..."
Expressiveness and paralinguistics. Laughter, hesitation, emphasis, and whispering are outside the distribution of most read-speech training corpora. Models may refuse to produce them or produce a bland approximation.
Codec artefacts at low bitrate. At the quantiser's rate boundary, codec LM outputs can exhibit metallic ringing or muffled transients, particularly on fricatives and plosives.
Prosody in emotionally loaded text. A sentence like "he was, of course, not guilty" can carry sarcasm, genuine relief, or flat reportage. Without explicit emotion conditioning, the model picks whatever was most frequent in training.
Hallucinated phonemes. Autoregressive acoustic models and codec LMs can occasionally insert or repeat a word, particularly at sentence boundaries where the attention or language model loses coherence.