← Concept library

Vision & Multimodal

Text Normalisation and Phonemisation

Text normalisation converts raw written text into a speakable form, and phonemisation maps those words to phoneme sequences; together they determine what a TTS system says before any audio is generated.

intermediate · 8 min read

A TTS system is handed a string like "Dr. Smith earned £1,995 in 1995." That single sentence contains at least four distinct normalisation problems: an abbreviation with context-dependent expansion, a currency symbol, a number that could be read either as "one thousand nine hundred ninety-five" or "nineteen ninety-five," and a trailing full stop that should not become a spoken word. Get any one wrong and the synthetic voice says something absurd. The acoustic model never sees this raw text; it receives a clean, unambiguous sequence of phonemes. Everything between the raw input and that sequence is the front-end, and it is where the majority of intelligibility failures in production TTS systems originate.

Text Normalisation: From Written to Spoken Form

Written language is full of constructions that have no direct spoken equivalent: numerals, currency, dates, URLs, abbreviations, acronyms, code snippets. Normalisation converts each of these into the sequence of words a human reader would actually say aloud.

The problem is fundamentally ambiguous. The string "1995" should expand differently depending on context:

Context Correct expansion
Birth year: "born in 1995" "nineteen ninety-five"
Page reference: "see page 1995" "one thousand nine hundred ninety-five"
Price: "costs $1995" "one thousand nine hundred ninety-five dollars"
Model number: "Model 1995" "model nineteen ninety-five"

Traditional systems handle this with a cascade of hand-written finite-state transducers (FSTs). A tokenisation FST segments the input into tokens; a classification FST labels each token (CARDINAL, ORDINAL, DATE, CURRENCY, etc.); a verbalisation FST expands each class into words. The Google Kestrel system, widely deployed in Android TTS, is built on exactly this architecture. FSTs are fast, interpretable, and deterministic, which makes them reliable in production, but they are expensive to author and require language-expert effort for every locale.

Neural approaches reframe normalisation as sequence-to-sequence transduction. A model reads a sentence and outputs the expanded form. Because it can attend to surrounding context, it can resolve many of the ambiguities that trip up a rule classifier. Ro et al. (2022) show that a two-stage model (a Transformer tagger followed by a seq2seq verbaliser with a fine-tuned BERT encoder) outperforms both pure RNN baselines and single-stage seq2seq models on English normalisation benchmarks. The key insight is that sentence-level context encoding, rather than token-level classification alone, is what resolves the hard cases.

A practical production pipeline typically combines both: FST rules handle the high-frequency, unambiguous cases (simple cardinals, obvious abbreviations) where a neural model adds latency for no gain, while the neural model handles the long tail of unusual and context-sensitive tokens.

Phonemisation: From Words to Phoneme Sequences

Once normalisation produces a clean word sequence, phonemisation maps each word to its pronunciation, expressed as a sequence of phonemes. English phoneme inventories typically use ARPAbet (39 phonemes, ASCII-friendly, used in CMU Pronouncing Dictionary) or IPA (International Phonetic Alphabet, preferred for multilingual systems).

The dominant approaches are:

Dictionary lookup. A curated pronunciation lexicon (CMU Pronouncing Dict for English; Celex for European languages) is queried first. Lookup is fast and accurate for in-vocabulary words, but any out-of-vocabulary (OOV) token - proper nouns, neologisms, brand names, code-switched words - requires a fallback.

Grapheme-to-phoneme (G2P) models. A learned model maps character sequences to phoneme sequences. This is the standard OOV fallback and, in end-to-end neural TTS, increasingly the primary path. Sequence-to-sequence architectures with attention were the dominant approach from roughly 2016 to 2021. More recent work replaces attention with CTC-based models that are significantly cheaper. Wang et al. (2023) demonstrate that combining expert phonological rules with a CTC network (LiteG2P) achieves accuracy comparable to full Transformer G2P at 33x lower computational cost, making on-device TTS practical.

Byte-pair encoding or character-level end-to-end. Some modern TTS systems (e.g., Tacotron variants trained directly on characters) sidestep explicit phonemisation entirely by learning the grapheme-to-acoustics mapping jointly. Wang et al. (2017) showed Tacotron could synthesise intelligible speech from character input without a pronunciation dictionary, though ambiguous pronunciations (heteronyms like "read" or "live") remain a persistent weakness without explicit disambiguation.

The output of phonemisation may include stress markers (primary stress, secondary stress), which inform prosody prediction downstream. ARPAbet encodes stress as a digit suffix: AE1 (primary), AE2 (secondary), AE0 (unstressed). This stress information is critical: "PREsent" (noun) versus "preSENT" (verb) share the same phonemes but carry different stress, and a system without correct stress produces unnatural prosody.

Heteronyms and Polyphones: The Hard Disambiguation Problem

Heteronyms are words spelled identically but pronounced differently depending on syntactic role or meaning. English examples:

  • "close" (adjective, rhymes with "dose") vs "close" (verb, rhymes with "toes")
  • "wind" (noun, rhymes with "pinned") vs "wind" (verb, rhymes with "find")
  • "lead" (metal, rhymes with "bed") vs "lead" (verb, rhymes with "bead")

A pure lookup against a single-pronunciation dictionary fails on these. The correct pronunciation requires at minimum a part-of-speech tag, and in ambiguous cases, full sentence-level context. Mandarin Chinese has a similar phenomenon called polyphones (多音字), where a single character has multiple readings; this is an active research problem for Chinese TTS, where the number of ambiguous characters in a single sentence can be substantial.

Disambiguation is typically handled by a separate classifier that predicts the pronunciation variant given POS tags and context embeddings. This adds a dependency on a POS tagger, which itself has accuracy limits on informal or domain-specific text.

When It Falls Down

Numbers in mixed-language text. A normalisation model trained on English behaves unpredictably when the surrounding text is French or German, even if the digit string is the same. Multilingual normalisation requires either language-conditioned models or separate per-language FST grammars.

Domain-specific abbreviations. "Dr." expands to "doctor" in general text but to "drive" in address parsing and "debtor" in legal documents. A single abbreviation classifier that ignores domain will misfire. Production systems often maintain domain-specific overrides, which become a maintenance burden.

Cascading errors. Normalisation and phonemisation are typically sequential. An error in normalisation (wrong token classification) propagates into phonemisation and then into acoustic modelling, with no mechanism for the later stage to correct an upstream mistake. End-to-end systems that learn the full mapping jointly can, in principle, avoid this, but interpretability suffers.

OOV proper nouns. A G2P model trained on common English words systematically mispronounces foreign names, brand names, and technical acronyms. "NVIDIA" is rendered correctly by most systems; "Čerenkov" or "Nguyen" often are not. There is no clean solution: some systems maintain a curated exceptions dictionary, which scales poorly.

Prosodic word boundaries. Normalisation produces a flat word sequence, but spoken language groups words into prosodic phrases with pauses and boundary tones. Inserting prosodic structure (phrasing, emphasis) from text alone requires syntactic parsing and pragmatic reasoning; most front-ends handle this inadequately, which is why TTS systems can sound grammatically correct but prosodically robotic on long, complex sentences.

Further Reading

Sign in to save and react.
Share Copied