← Concept library

Vision & Multimodal

Whisper: Weakly-Supervised ASR

Whisper trains a sequence-to-sequence Transformer on 680,000 hours of weakly-supervised internet audio to achieve robust multilingual speech recognition without task-specific fine-tuning.

intermediate · 8 min read

Six hundred and eighty thousand hours. That is the scale of audio OpenAI scraped from the internet to train Whisper, released in September 2022. For context, the LibriSpeech benchmark that dominated prior ASR research contains roughly 1,000 hours. The bet was simple: instead of curating a small, clean labelled corpus, collect a massive one that is inherently noisy, and let scale do the work that data quality used to do.

The result challenged a decade of conventional ASR wisdom about the necessity of carefully controlled training data.

What "weakly supervised" actually means

The standard supervised ASR pipeline requires paired audio and transcripts produced by human annotators or a controlled recording studio. Every utterance is clean, every transcript is verified. This is expensive and does not scale.

Whisper's training data comes from the web: audio paired with text that was already associated with it - subtitles, captions, transcripts posted alongside video or audio files. These labels were not produced by human annotators inspecting the audio; they were produced by whoever uploaded the content, or by automatic captioning systems. Hence "weakly supervised": the supervision signal exists but is noisy, potentially misaligned, and of variable quality.

The key insight from Radford et al. (2022) is that at sufficient scale this noise averages out. The model sees enough variation in accents, recording conditions, background noise, and transcription styles that it learns robust representations rather than overfitting to any particular acoustic environment.

This differs from self-supervised approaches (such as wav2vec 2.0, arXiv:2006.11477), where the model learns representations entirely from unlabelled audio via contrastive or masked-prediction objectives and is then fine-tuned on labelled data. Whisper never uses a fine-tuning stage; it is trained end-to-end on the weakly-labelled pairs and evaluated directly in zero-shot.

Architecture: a deliberately unexciting choice

Whisper is a standard encoder-decoder Transformer, identical in structure to the models used for neural machine translation. There is no architectural novelty here, and that is deliberate.

Encoder. The input is a log-mel spectrogram computed over 30-second audio windows (80 mel bins, 25 ms windows, 10 ms hop). Two convolutional layers with GELU activations downsample the time axis before the signal enters a stack of Transformer encoder blocks. Sinusoidal positional embeddings are added after the convolutions.

Decoder. A standard autoregressive Transformer decoder generates tokens one at a time, attending to the encoder output via cross-attention.

Multitask conditioning via special tokens. Rather than training separate models per task, Whisper uses a sequence of special tokens prepended to the decoder input to specify what the model should do:

<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>
  • Language token (<|en|>, <|fr|>, etc.) - identifies the source language, or is predicted by the model if not provided.
  • Task token (<|transcribe|> or <|translate|>) - transcription keeps the source language; translation produces English output regardless of source.
  • Timestamp token - if timestamps are requested, the decoder interleaves time tokens between word groups.

This single-model multitask formulation lets one checkpoint handle 99 languages, transcription, translation, language identification, and voice activity detection. The model sizes range from 39 M parameters (Tiny) to 1.55 B (Large), with a later Turbo variant at 809 M.

Training data and the 680k pipeline

The data pipeline filtered raw web audio with several heuristics:

  • Removed content where the transcript was likely machine-generated (detected by comparing the distribution of character repetitions and punctuation patterns).
  • Applied language detection to exclude audio whose spoken language did not match the associated transcript language.
  • Filtered out very short or very long segments.

The resulting corpus spans 97 languages. English dominates (roughly 65% of total hours), but the long tail of lower-resource languages is large enough that models generalise to them without language-specific fine-tuning.

A notable consequence: the model's transcription quality correlates strongly with the hours of training data per language. Languages with fewer than ~1,000 training hours show substantially higher word error rates, particularly for character-level languages like Chinese and Japanese where the training signal relies on correct Unicode handling by the original captioners.

Decoding and timestamp generation

At inference time, beam search (beam size 5 by default) is used. Because the model processes fixed 30-second windows, long audio is handled by sliding the window and using predicted timestamps to determine where the next window should begin.

The model can also produce word-level timestamps. The mechanism is approximate: the decoder generates special <|t|> tokens interleaved with word tokens, trained against the alignment between transcript text and audio derived from existing forced-alignment tools run over the training data. This is not a CTC-based monotonic alignment; it is a learned sequence-to-sequence prediction of time offsets, so it can drift on dense speech.

Hallucination in silence is a known decoding problem. When the input audio contains very little speech (background noise, music, long pauses), the autoregressive decoder sometimes generates plausible-sounding text anyway rather than producing the end-of-sequence token. The <|nospeech|> token addresses this partially: its logit probability can be thresholded, but the threshold must be tuned per use case.

When it falls down

Hallucination. On near-silent segments or segments with non-speech audio (music, ambient noise), Whisper will sometimes produce fluent but entirely fabricated text. This is the most dangerous failure mode in production transcription pipelines. No beam score threshold reliably catches it in all conditions.

Low-resource language quality. Languages with fewer than a few hundred training hours show error rates far above what the English benchmarks suggest. Whisper's zero-shot WER on low-resource languages can exceed 50% even on clean audio.

Punctuation and formatting bias. Because the training data comes from internet captions, which tend to follow broadcast English conventions, Whisper imposes English-style capitalisation and punctuation even on non-English output. For languages with different orthographic conventions this introduces systematic errors.

Real-time streaming is not native. The architecture processes fixed 30-second chunks. Streaming requires chunking with overlap and re-stitching, which introduces latency and occasional mis-joins at boundaries. Dedicated streaming ASR models (RNN-T, Conformer-based systems) are better suited when end-to-end latency below 500 ms is required.

Accents and domain shift. Despite large-scale training, Whisper degrades noticeably on strongly accented speech and specialised vocabularies (medical, legal, technical). Fine-tuning on even a small in-domain corpus recovers much of this gap, but Whisper was not designed for fine-tuning and the standard recipe requires careful learning rate scheduling to avoid catastrophic forgetting.

Timestamp precision. The interleaved timestamp tokens give segment-level alignment, not phoneme-level. For subtitle generation this is sufficient; for forced alignment in linguistic research or speaker diarisation, a dedicated aligner is still needed.

Further reading

Sign in to save and react.
Share Copied