Whisper's Multitask Decoder

A single neural network that transcribes 98 languages, translates any of them into English, identifies what language is being spoken, and predicts word-level timestamps - yet has no branching architecture, no task-specific output heads, and no routing logic. That sounds implausible. Whisper achieves it by treating every task as a text generation problem conditioned on a short prefix of purpose-built tokens prepended to the decoder's input at inference time.

The Core Idea: Tasks as Token Prefixes

Whisper is an encoder-decoder Transformer. The encoder converts a 30-second log-mel spectrogram into a sequence of acoustic embeddings. The decoder then autoregressively generates tokens, just like a language model - but the first few tokens it receives are not from the audio at all. They are a hand-crafted prefix that tells the model what to do.

The standard prefix sequence looks like this:

<|startoftranscript|>  <|en|>  <|transcribe|>  <|notimestamps|>

Each position carries a specific meaning:

Token	Role	Examples
`<\|startoftranscript\|>`	Begins the decoder sequence; signals audio is present	Fixed
Language token	Selects the spoken language	`<\|en\|>`, `<\|fr\|>`, `<\|zh\|>` (99 options)
Task token	Selects the output behaviour	`<\|transcribe\|>` or `<\|translate\|>`
Timestamp token	Controls whether timestamps are interleaved	`<\|notimestamps\|>` or omitted

After this prefix the decoder generates the transcript normally, one token at a time, attending back to both the prefix and the encoder output via cross-attention. The model never sees a separate loss signal for each task - during training, a single cross-entropy loss on the output sequence covers all of them simultaneously.

This is not a new idea in principle. Prefix-conditioned multitask learning dates back at least to T5, where tasks were specified as natural-language prefixes ("translate English to German: ..."). Whisper applies the same logic to audio, replacing natural-language strings with atomic special tokens for efficiency and unambiguous conditioning.

Training Pushes Task Knowledge into the Token Embeddings

The prefix works because the embeddings of the special tokens accumulate task-specific information during training on 680,000 hours of weakly supervised audio (Radford et al., 2022). The training data is a large corpus scraped from the internet: audio paired with human-generated transcripts or subtitles, in many languages, with variable quality. Each example is labelled with its language and task type, and the corresponding prefix is prepended before the transcript.

After training, the embedding of <|translate|> encodes a strong prior toward English output tokens and toward translation-style paraphrasing. The embedding of <|transcribe|> encodes the opposite prior - fidelity to what was said in the original language. The decoder's attention layers have learned to modulate their behaviour based on which task token sits in the prefix, even though the underlying architecture is identical for both paths.

The language token does double duty. If you set it at inference time, it suppresses the model's tendency to hallucinate language switches mid-transcript. If you omit it (or pass <|startoflm|> for language-model-only mode), the model must infer the language from the acoustic signal alone, which it does with reasonable accuracy but at some cost in precision.

Timestamp Prediction as a Parallel Output Stream

One of the less obvious design choices is how Whisper handles timestamps. Rather than running a separate alignment step (as Montreal Forced Aligner does), Whisper interleaves timestamp tokens directly in the output sequence. When timestamp prediction is enabled, the sequence looks like:

<|0.00|> Hello, <|0.48|> how are <|0.84|> you doing? <|1.32|>

Timestamp tokens span a vocabulary of roughly 1500 special tokens representing times from 0 to 30 seconds in 20 ms increments. The decoder learns to emit them at the right positions by being trained on subtitles that carry timing information. The model treats choosing a timestamp token as exactly the same softmax operation over the full vocabulary as choosing a word token; there is no additional output head.

This is elegant but fragile. Timestamp accuracy degrades on fast speech, overlapping speakers, and music-heavy backgrounds because the model has to balance two competing objectives - linguistic coherence and temporal precision - with a single next-token prediction objective.

Long-Form Transcription: Sliding Window + Context Conditioning

The architecture processes exactly 30 seconds of audio per forward pass (the encoder input is always padded or truncated to 30 s). For audio longer than that, Whisper uses a sliding window with overlap, and the key mechanism is condition_on_prev_tokens: the transcript from the previous window is fed back as a prefix to the next window's decoder. This keeps terminology, spelling choices, and style consistent across window boundaries.

Pseudocode for the long-form loop:

transcript = []
for window_start in range(0, audio_length, stride):
    segment = audio[window_start : window_start + 30s]
    prefix  = [SOT, lang_token, task_token] + transcript[-max_context:]
    tokens  = decoder.generate(encoder(segment), prefix=prefix)
    transcript.extend(tokens)

The previous-transcript conditioning is powerful but also the main source of the "repetition loop" failure mode described below.

When It Falls Down

Hallucination on silence or noise. When the 30-second window contains very little speech, the decoder is still conditioned to produce a transcript. It often generates plausible-sounding but entirely fabricated text - phrases like "Thank you for watching" are notorious. The <|nospeech|> token is supposed to suppress this: if its logit exceeds a threshold the segment is skipped. In practice the threshold tuning is sensitive and the feature does not fully solve the problem.

Repetition collapse. Once the decoder enters a repetition loop (e.g. "I see, I see, I see...") the context-conditioning mechanism actively reinforces it in subsequent windows, since the repeated text becomes the previous-window prefix. Beam search with high beam counts worsens this because it increases the probability of committing to the highest-probability next token, which in a loop is the repeated phrase.

Low-resource language degradation. Performance on languages with small training-set fractions (many African and Southeast Asian languages) is substantially worse. The language token still works, but the cross-attention weights have less useful acoustic-to-phoneme mapping to draw on. Word error rates can exceed 50% even on clean audio.

Accent and dialect bias. The model was trained predominantly on internet audio, which skews toward standard accents and broadcast-quality recordings. Non-native speakers and regional dialects suffer higher word error rates, a pattern confirmed in the official model card.

No speaker diarisation. The multitask decoder was never trained to identify who is speaking. In multi-speaker audio the transcript is a single undifferentiated stream with no speaker labels. Post-processing with a separate diarisation model (e.g. pyannote) is needed, and aligning the two outputs is non-trivial.

30-second context limit. Long technical discussions where a term is introduced early and referenced later cannot benefit from that earlier context in the way an attention window that spans the whole audio could. The sliding-window approach loses global context by design.

The Core Idea: Tasks as Token Prefixes

Training Pushes Task Knowledge into the Token Embeddings

Timestamp Prediction as a Parallel Output Stream

Long-Form Transcription: Sliding Window + Context Conditioning

When It Falls Down

Further Reading