Vision & Multimodal
Prosody and Style Control
How TTS systems encode and manipulate pitch, duration, energy, and speaking style so that synthesised speech sounds intended rather than merely intelligible.
intermediate · 8 min read
Two sentences of text can carry identical words yet arrive in the listener's brain as a sharp command or a warm invitation. That gap is prosody: the suprasegmental layer of speech covering pitch contour (F0), phoneme duration, loudness, and the pauses between words. Getting it right is what separates a voice assistant that grates after ten seconds from one you forget is synthetic.
What prosody actually is
Prosody operates above the phoneme level. A single phoneme can be stretched, squeezed, raised in pitch, or dropped; run the same sequence of phonemes through different prosodic patterns and listeners report completely different emotions, sentence structures, and speaker intentions.
The four main dimensions:
| Dimension | Acoustic correlate | Perceptual effect |
|---|---|---|
| Pitch | Fundamental frequency (F0) | Intonation, question vs. statement, emotion |
| Duration | Phoneme/syllable length | Rhythm, stress, emphasis |
| Energy | RMS amplitude | Loudness, sentence stress |
| Voice quality | Spectral tilt, jitter, shimmer | Breathiness, creakiness, age cues |
Early concatenative TTS handled prosody via rule-based F0 templates: lookup tables that mapped syntactic labels to target pitch points, then interpolated. These rules encoded decades of phonetic fieldwork but generalised poorly to unexpected sentence structures. Neural models learned prosody statistics from data instead, which worked until the data itself was flat-voiced read speech.
How neural TTS models prosody
Modern neural TTS systems learn prosody implicitly or represent it explicitly.
Implicit learning. Sequence-to-sequence models like Tacotron 2 (Shen et al., 2018) learn to produce mel-spectrograms from text. Prosody emerges from the attention mechanism and the decoder's recurrent state. The problem: there is no handle to grip. You cannot tell the model to sound more surprised without retraining, because prosody is entangled with everything else in the hidden state.
Explicit duration modelling. FastSpeech (Ren et al., 2019) introduced a length regulator: a duration predictor (a small feed-forward network) estimates how many mel frames each phoneme should occupy, then literally copies encoder outputs that many times before the decoder. This decouples duration from the acoustic decoder and lets you scale it at inference:
phone_encodings: [p1, p2, p3]
predicted_durations: [3, 5, 2]
expanded: [p1, p1, p1, p2, p2, p2, p2, p2, p3, p3]
^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^
3 frames 5 frames 2 frames
Multiply all durations by 0.8 and the voice speeds up uniformly. Multiply selected phonemes and you get contrastive stress. This is a simple but powerful interface.
Variational latent spaces. VITS (Kim et al., 2021) uses a conditional variational autoencoder with normalising flows and a stochastic duration predictor. The stochasticity is the point: at inference, sampling different latent vectors produces varied rhythms from the same text, reflecting the one-to-many nature of natural speech. You can also condition the latent on a reference audio to transfer rhythm and speaking rate.
Style control via reference encoders and tokens
The breakthrough for high-level style control came from Global Style Tokens (GST, Wang et al., 2018).
The architecture adds a reference encoder: a small convolutional stack that compresses a reference audio clip into a single fixed-length vector. This vector is fed into Tacotron's encoder via a multi-head attention over a small bank of learnable style token embeddings.
reference audio --> [reference encoder] --> query
style token bank --> [multi-head attention] --> style embedding
style embedding --> Tacotron encoder (added to text encoding)
Without any style labels, the model discovers axes of variation: one token captures speaking rate, another captures breathiness, another maps to an emotional register. You can:
- Transfer style by running a reference clip through the encoder at inference.
- Interpolate styles by blending token weights: 70% token A + 30% token B.
- Directly set weights by bypassing the reference encoder and specifying the attention distribution manually.
The same principle underlies more recent disentangled resynthesis work. Polyak et al., 2021 extract separate self-supervised representations for content, prosody, and speaker identity, then recombine them freely - swap only the prosody representation from speaker B onto the text content of speaker A to change rhythm without changing voice identity.
Prompt-based and instruction-based control
With the rise of large language models as TTS controllers, a new interface emerged: natural-language style prompts.
Systems like InstructTTS and PromptTTS accept descriptions such as "speak slowly and warmly" alongside the text, encode the description with a text encoder, and condition the acoustic model on the resulting embedding. The conditioning mechanism is usually cross-attention or classifier-free guidance. This trades precision for naturalness of interface: a user does not need to know what a Global Style Token weight means.
The limitation is that natural language descriptions of prosody are ambiguous. "Calm" means something different in a children's bedtime story versus a crisis-line operator training script. Models trained on described-style datasets inherit whatever annotation bias was in the training corpus.
When it falls down
Prosody imitation overfits to speaker identity. Reference encoders do not cleanly separate style from voice: a reference clip from a fast-talking child produces a different style embedding than the same script read fast by a deep-voiced adult, even though "fast" is the intended signal. Disentanglement is partial; pure prosody transfer without leaked identity cues remains an open problem.
Long documents accumulate monotony. Models that predict prosody phoneme-by-phoneme without discourse context flatten out over paragraphs. The first sentence of a paragraph tends to get rising intonation; subsequent sentences converge toward a neutral plateau. Humans use paragraph structure, contrast, and new-versus-given information to vary prosody over longer spans. Most TTS systems have no representation for this.
Duration predictors fail on code, numbers, and abbreviations. A duration model trained on read speech has poor calibration for strings like "API", "HTTP/2", or "12,304.7". The model has no prior about how a speaker would time the expansion "Application Programming Interface". Errors in text normalisation upstream compound into timing artefacts.
Style tokens collapse under distributional shift. The unsupervised tokens are only meaningful within the domain of the training corpus. A GST model trained on audiobooks learns audiobook-specific style axes. Applied to conversational or spontaneous speech, the token assignments become unreliable and the style interpolations produce unexpected results.
Evaluation is hard. Mean Opinion Scores measure general quality; they are poor proxies for prosodic appropriateness. A voice can score 4.3/5 on naturalness while systematically sounding indifferent during text that should be urgent. Automated prosody metrics (F0 correlation, duration RMSE) measure acoustic fidelity to a reference but not communicative effectiveness.
Further reading
- FastSpeech: Fast, Robust and Controllable Text to Speech (Ren et al., 2019)
- Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis (Wang et al., 2018)
- Conditional Variational Autoencoder with Adversarial Learning for End-to-End TTS / VITS (Kim et al., 2021)
- Speech Resynthesis from Discrete Disentangled Self-Supervised Representations (Polyak et al., 2021)