Word Error Rate and ASR Evaluation

A system that transcribes "recognise speech" as "wreck a nice beach" scores 100% WER. That one sentence captures why evaluation is not a solved problem: phonetically plausible confusions, surface-level token mismatches, and brittle normalisation decisions all collapse into a single number that can mislead as easily as it informs.

The edit-distance definition

WER is the Levenshtein distance at the word level, normalised by the number of words in the reference:

WER = (S + D + I) / N

where S = substitutions, D = deletions, I = insertions, and N = total words in the reference. This is computed by aligning hypothesis and reference with dynamic programming, the same algorithm behind diff tools. An alternative - Character Error Rate (CER) - applies the same formula at the character level and is standard for languages without clear word boundaries (Mandarin, Japanese, Thai).

Error type	Example (ref / hyp)	Count
Substitution	"speech" / "peach"	S += 1
Deletion	"the cat" / "cat"	D += 1
Insertion	"cat" / "the cat"	I += 1

WER can exceed 100% when insertions are many. This surprises newcomers but is mathematically consistent: N anchors the denominator, not the hypothesis length.

A near-identical measure, Match Error Rate (MER), uses the longer of reference and hypothesis as the denominator. Word Information Lost (WIL) adds precision/recall flavour. In practice, these alternatives rarely displace WER in published results because decades of benchmarks are already denominated in it.

Why normalisation dominates everything

The same audio, passed through the same model, can produce WERs that differ by 5-10 absolute percentage points depending solely on text pre-processing. This is the uncomfortable truth that Whisper's 2022 paper highlighted explicitly when reporting results on LibriSpeech and other corpora (Radford et al., arXiv:2212.04356).

Common normalisation steps and their effects:

Case folding. Lowercasing reference and hypothesis before comparison prevents penalising "NASA" vs "nasa". Almost universal in English benchmarks.

Punctuation removal. Commas, full stops, and quotation marks carry no acoustic signal in most tasks, so they are stripped. But disfluencies like "(uh)" may or may not be stripped depending on the benchmark protocol.

Number expansion/contraction. Does "five" equal "5"? Under naive string comparison, no. Whisper's EnglishTextNormalizer converts both to a canonical form; systems without this conversion look worse on financial or technical audio.

Filler word handling. LibriSpeech reads aloud clean text, so fillers are rare. Conversational corpora (Switchboard, AMI) are saturated with "um", "uh", "you know". Whether you collapse or keep fillers swings WER several points.

Compound words. German compounds are a classic trap: "Bundesverfassungsgericht" versus "Bundes verfassungs gericht" produces wildly different WERs from the same acoustic quality.

The practical rule: always report the normalisation pipeline alongside the number. A 3.5% WER on LibriSpeech test-clean with normalisation is not the same claim as 3.5% without it.

Standard benchmarks and their quirks

LibriSpeech is the dominant English benchmark: 960 h of audiobook speech, clean studio conditions, relatively constrained vocabulary. test-clean WERs for top models now sit below 2%, which means the benchmark is arguably saturated for well-resourced English. test-other is harder (diverse speakers, accent variation) and a more honest test.

CHiME-6 tests far-field microphone arrays in dinner-party conditions. State-of-the-art WERs here run above 30% even with oracle speaker segmentation, underscoring that far-field noise remains genuinely hard.

Switchboard and CallHome cover telephone conversational speech. The Fisher+Switchboard training set defined a generation of industry systems; errors cluster around fast speech and heavy reduction.

SUPERB and ML-SUPERB bundle ASR alongside speaker identification, emotion, keyword detection, and other tasks into a single benchmark, reflecting the field's move toward universal speech models.

One critical detail about LibriSpeech: the test sets come from a specific set of audiobooks. Language models trained on any text that overlaps with those books gain an unfair advantage. Some researchers explicitly control for this contamination; many do not.

Beyond WER: semantic and learned metrics

WER penalises "automobile" and "car" equally even though they are synonymous. This matters for downstream tasks - an information retrieval system often recovers from synonymous confusions but not from homophone errors like "there/their". Several alternatives have been proposed:

SBERT-WER / semantic WER embeds reference and hypothesis words and uses soft alignment instead of exact string match. Closer in spirit to task-level utility but harder to reproduce exactly.

ChrF and BERTScore are borrowed from machine translation and sometimes applied to ASR outputs, particularly in speech translation pipelines.

Word Information Preserved (WIP) measures what fraction of the reference content survived the transcription.

None of these has displaced WER as the primary reporting metric in the speech community. The reason is partly inertia, partly that WER has a clean causal story: each editing operation corresponds to something a downstream system must correct.

When it falls down

Homophone-rich domains. Legal or medical dictation saturates with "principal/principle", "ileum/ilium", "discrete/discreet". WER cannot distinguish a medically dangerous confusion from a stylistic one.

Code-switching and transliteration. Mixed English/Hindi speech means reference transcription conventions are themselves contested: Romanisation versus Devanagari, spacing conventions, loanword treatment. Two annotators produce different references; the "correct" WER is undefined.

Long-tail vocabulary. A system that correctly transcribes 100 common words but fails on the single proper noun in a sentence scores the same as a system that fails on a common word. Named entity accuracy is often far more operationally important than aggregate WER.

Speaker-independent versus speaker-adapted evaluation. Systems adapted to a test speaker (via i-vector or prompt conditioning) naturally score lower WER. Comparing adapted to unadapted numbers without flagging this is a common source of misleading claims.

Streaming versus batch. Streaming ASR produces intermediate hypotheses that are later revised. "Final" WER at end-of-utterance is very different from "real-time" WER, which counts edits visible to downstream systems mid-stream. Most published numbers report the former.

Punctuation and capitalisation as a proxy for downstream quality. For voice assistants, punctuation-free output is acceptable. For transcription services feeding editors, missing capitalisation and punctuation are serious errors that WER completely ignores.

Floor effects. When WER on a standard benchmark approaches 1-2%, differences are dominated by annotation disagreements and normalisation rather than model quality. The community would benefit from more challenging benchmarks, not just better scores on saturated ones.

The edit-distance definition

Why normalisation dominates everything

Standard benchmarks and their quirks

Beyond WER: semantic and learned metrics

When it falls down

Further reading