Evaluating TTS with MOS

Two TTS systems can produce speech with identical word-error rates and near-identical spectrograms, yet human listeners reliably prefer one over the other. Automatic metrics cannot yet fully capture that gap, which is why subjective listening tests and, specifically, the Mean Opinion Score (MOS) remain the de-facto gold standard for TTS evaluation nearly three decades after ITU-T codified the scale.

What MOS actually measures

MOS comes from ITU-T Recommendation P.800 (1996), originally designed for telephone transmission quality. Listeners rate each utterance on a 1-to-5 scale:

Score	Label	Typical perception
5	Excellent	Completely natural; indistinguishable from a real speaker
4	Good	Slight differences; no annoyance
3	Fair	Noticeable differences; slight annoyance
2	Poor	Annoying but intelligible
1	Bad	Very annoying; difficult to understand

The reported MOS is the arithmetic mean of all listener ratings for a system. A difference of 0.1-0.2 MOS points is often treated as meaningful in the TTS literature, though statistical significance depends entirely on sample size and rater variance.

Two common variants appear in TTS papers:

MOS-N (naturalness): listeners rate how natural the speech sounds, ignoring content.
MOS-I (intelligibility): less common; listeners rate how easy the speech is to understand.

Most modern papers report MOS-N, often shortened to just "MOS".

Running a credible listening test

A properly designed MOS study involves several choices that directly affect the result.

Listener pool. Crowdsourced platforms (Amazon Mechanical Turk, Prolific) give scale but introduce noise: non-native speakers, headphone variation, and inattentive workers. Lab studies with screened native speakers are more controlled but expensive and slow to scale. Neither is inherently wrong, but they are not directly comparable.

Stimulus set. Sentences should span prosodic variety (questions, statements, long and short), cover the phoneme inventory, and be drawn from held-out text the system has never seen. Reusing training sentences inflates scores. A typical test uses 50-100 utterances per system condition.

Anchoring. Ratings are relative, not absolute. Presenting listeners with a natural reference recording (the "golden" anchor) and a clearly degraded anchor (e.g., vocoded speech) anchors the scale and reduces inter-rater variance. Without anchors, one lab's "4.2" is not comparable to another lab's "4.2".

Hidden reference and gold standard. Including the natural reference as a hidden test item lets you check whether listeners are paying attention: if they rate the reference far below 5, the data is suspect.

Randomisation and balance. Utterances should be randomised in order per listener. Each listener should hear each system on a balanced subset of stimuli to prevent system-utterance confounds.

A minimal test design in pseudocode:

stimuli = sample(utterances, n=80)
conditions = [system_A, system_B, natural_ref, degraded_anchor]

for listener in listeners:
    trial_order = shuffle(stimuli × conditions)
    for (utterance, condition) in trial_order:
        audio = synthesise(utterance, condition)
        rating = collect_rating(audio, scale=1..5)

mos[condition] = mean(ratings[condition])
ci_95[condition] = 1.96 * std(ratings[condition]) / sqrt(n_ratings)

Always report the 95% confidence interval. A table showing only the mean MOS without error bars conceals whether differences are significant.

Automatic MOS prediction

Crowdsourced listening tests cost money and take days. This has motivated neural MOS predictors that score utterances automatically.

UTMOS (Saeki et al., 2022), winner of the VoiceMOS Challenge 2022, fine-tunes a self-supervised speech representation (SSL) model on human MOS labels and achieves strong correlation with ground-truth scores. The approach:

Extract frame-level features from a pre-trained SSL model (e.g., a model trained with masked prediction on large audio corpora).
Pool to utterance-level with an attention mechanism.
Regress to a scalar MOS score, jointly trained with auxiliary tasks (e.g., listener ID).

UTMOS correlates well with human MOS on the BVCC dataset (Pearson r > 0.94 on the main track), but this strong in-distribution performance does not generalise freely to out-of-distribution systems, as the SOMOS study showed.

The SOMOS dataset (Maniati et al., 2022) provides 20,000 MOS-labelled utterances from 200 neural TTS systems built on a single voice. Their evaluation found that state-of-the-art MOS predictors trained on earlier corpora (which included vocoded and concatenative speech) perform significantly worse on modern neural TTS, because the score distribution shifts: most neural TTS scores cluster between 3.5 and 4.5, collapsing the range that earlier models trained on. This is the "ceiling effect" problem.

Side-by-side and CMOS

When comparing two specific systems, the Comparative MOS (CMOS) test is more sensitive than running two independent MOS studies. Listeners hear a pair of utterances (system A and system B, counterbalanced) and rate the difference on a -3 to +3 scale:

Score	Meaning
+3	A is much better than B
0	No preference
-3	B is much better than A

A CMOS of 0.1-0.2 in favour of a new system is a meaningful improvement by most conventions. Because listeners directly compare pairs, CMOS tests control for inter-session scale drift and require fewer listeners than independent MOS studies to detect the same effect size.

When it falls down

Context effects. Chiang et al. (2023) ran MOS tests on three established TTS systems under varying conditions (different listener pools, instructions, and payment levels) and obtained at least three distinct rankings of the same systems. The ranking changed depending on who was listening and how they were recruited. This finding is uncomfortable: it means many published MOS comparisons may not be reproducible outside the original lab's setup.

Domain mismatch for neural predictors. As noted above, automatic MOS models trained on mixed corpora (concatenative + vocoder + neural) overestimate degraded speech and underestimate high-quality neural TTS. Reporting UTMOS scores without disclosing the training distribution of the predictor is misleading.

Naturalness is not the only axis. A voice can score 4.3 MOS-N while being entirely wrong in speaker identity (in voice cloning tasks), wrong in prosody for the emotional context (expressive TTS), or intelligible only for native speakers of the training language. MOS-N is blind to all of these.

Short utterances inflate scores. Listeners find it harder to detect prosodic problems in single short sentences. MOS tests on 10-word sentences systematically over-estimate quality relative to how the system performs on paragraph-length speech.

Publication bias. Papers rarely publish MOS results that make their system look worse. The field accumulates a positive-selection bias; MOS numbers across papers are not comparable without controlled conditions.

Crowdsourced noise. Without gold standard quality checks embedded in the survey (questions with known correct answers), 10-30% of crowdsourced ratings may come from inattentive workers, adding substantial variance.

What MOS actually measures

Running a credible listening test

Automatic MOS prediction

Side-by-side and CMOS

When it falls down

Further reading