Vision & Multimodal
The ASR Problem and Pipeline
Automatic speech recognition converts a raw audio waveform into a word sequence by solving an alignment problem that classical NLP never had to face.
beginner · 7 min read
Every second of telephone-quality audio contains 8,000 amplitude samples. A ten-word sentence takes roughly three seconds. The job of an ASR system is to map those 24,000 numbers onto the right sequence of words - without knowing in advance which audio frames correspond to which phoneme, or even how many phonemes are present. That alignment uncertainty is what makes speech recognition structurally different from text classification or machine translation, and every architectural choice in the pipeline exists to manage it.
The Raw Signal and Why You Cannot Feed It Directly
A microphone produces a time-domain waveform: pressure amplitude over time. Feeding raw samples into a sequence model is theoretically possible but wasteful. Adjacent samples are correlated at the scale of milliseconds; the discriminative information (which vowel is being spoken, which consonant boundary just occurred) lives in frequency patterns over 20-30 ms windows.
The standard preprocessing step converts those windows into a log-mel spectrogram.
- Divide the waveform into overlapping frames (e.g. 25 ms window, 10 ms hop).
- Apply a Fast Fourier Transform to each frame to get a power spectrum.
- Map the power spectrum onto a mel-scale filterbank (typically 80 filters). The mel scale approximates human auditory frequency resolution: finer bins at low frequencies, coarser at high.
- Take the log of the filterbank energies.
The result is a 2-D array of shape (T, 80), where T is the number of frames. This is the acoustic feature matrix that every downstream model receives. Some systems go one step further and compute MFCC coefficients (discrete cosine transform of the log-mel energies), but most modern neural approaches prefer log-mel directly because it preserves more spectral detail.
waveform → framing → FFT → mel filterbank → log
shape: (N,) (T, 512) (T, 80) (T, 80)
The Alignment Problem and How CTC Solves It
Once you have a feature matrix of shape (T, 80), you need to produce a token sequence of length S, where S << T. For a three-second utterance you might have T=300 frames and S=12 characters. The model cannot know which frame boundaries correspond to which character.
Connectionist Temporal Classification (CTC), introduced by Graves et al. in 2006, handles this by introducing a special blank token and marginalising over all valid alignments. The model predicts, at each frame, a distribution over the vocabulary plus blank. The loss sums the probability of all frame-level label sequences that collapse (by removing blanks and repeated tokens) to the target string.
Formally, if y is the target sequence and pi is any valid frame-level path:
P(y | x) = sum over all pi such that B(pi) = y of product_t P(pi_t | x)
where B is the collapse function. This marginalisation is computed efficiently with a forward-backward dynamic programme over the lattice of paths.
CTC has one hard constraint: it assumes conditional independence between output tokens given the input. Each frame's output depends only on the encoder hidden state at that frame, not on previously emitted tokens. This makes greedy decoding cheap (argmax at every frame, then collapse) but limits the model's ability to learn output-side language patterns. A language model rescorer applied post-hoc partially compensates.
RNN-T: Adding Output Dependency
The RNN Transducer (Graves, 2012) replects CTC's conditional-independence limitation by adding a prediction network: a small recurrent network that reads the previously emitted non-blank token and produces a context vector. A joiner (or joint network) combines the encoder hidden state at time t with the prediction context to produce the output distribution.
The key difference from CTC: the prediction network gives RNN-T an internal language model. It can represent the probability of "the" being followed by a noun, without an external LM. This makes RNN-T significantly stronger on long-tail words and proper nouns.
The alignment algorithm changes accordingly. Instead of a fixed T x |vocab| matrix, RNN-T defines a lattice of size T x S, traversed by "emit a token" or "advance one frame" steps. The forward-backward algorithm still applies, but over this 2-D lattice.
RNN-T is the dominant architecture in production streaming ASR (Google's voice search, Amazon Alexa) because it naturally supports frame-by-frame emission: you never need a full-utterance buffer to decode.
Attention-Based Encoder-Decoder and the Conformer
An alternative to CTC/RNN-T is the sequence-to-sequence with attention approach: encode the full input sequence, then autoregressively decode the output token-by-token using cross-attention to the encoder states. This is the architecture of Listen, Attend and Spell (Chan et al., 2016) and later transformer-based ASR.
Attention models are typically more accurate than CTC alone because the decoder has full access to the encoded input and learns rich output-side context. The trade-off is latency: full-sequence encoding requires completing the utterance before decoding begins (though chunk-based streaming extensions exist).
The Conformer (Gulati et al., 2020) is the current dominant encoder for attention and CTC ASR. It replaces the standard transformer block with a structure that interleaves:
- A feed-forward module (half the usual scale)
- A multi-head self-attention module
- A convolution module (depthwise separable, kernel size 31 or 15)
- A second feed-forward module
The motivation is that self-attention captures long-range dependencies (across hundreds of frames) while the convolution captures local spectral patterns (adjacent frames, formant transitions). On LibriSpeech, a medium Conformer achieves 2.1% WER (test-clean) without a language model, which is roughly human-level on that benchmark.
Conformer block:
x -> FFN (0.5x) -> MHSA -> Conv -> FFN (0.5x) -> LayerNorm -> output
Whisper: Scaling Weak Supervision
All the architectures above assume carefully transcribed training data. Whisper (Radford et al., 2022) takes a different route: train a standard encoder-decoder transformer on 680,000 hours of audio scraped from the internet, paired with transcripts of highly variable quality, across 99 languages.
The model learns from the breadth rather than the depth. Because the training set is so large and diverse, Whisper generalises to accents, background noise, and domains that would require explicit adaptation in a supervised system. It does not beat state-of-the-art supervised models on clean read speech (LibriSpeech test-clean), but on real-world noisy audio - podcast speech, medical dictation, accented English - it is often more robust.
Whisper uses a straightforward architecture: a log-mel spectrogram (80 bins, 25 ms window, 10 ms hop) fed into a convolutional stem, then a transformer encoder, then a transformer decoder with cross-attention. The decoder is prompted with special tokens that specify the task (transcription vs. translation), language, and whether to include timestamps. This prompt-based multitask training means a single model handles multilingual ASR and speech translation.
When It Falls Down
Spontaneous speech and disfluencies. Read speech (audiobooks, news) is structurally very different from conversational speech. Filled pauses ("um", "uh"), false starts, overlapping speakers, and heavy reduction (unstressed vowels collapsing to schwa) all degrade performance sharply. Models trained on audiobooks often produce WER of 40-60% on casual conversation.
Rare and out-of-vocabulary words. CTC and RNN-T predict from a fixed vocabulary or character set. Proper nouns (names of people, places, products) that appear rarely in training are frequently mis-transcribed. Contextual biasing (injecting a hotword list at inference) mitigates this but adds engineering complexity.
Streaming vs. accuracy trade-off. Full-sequence attention models are more accurate but cannot stream. RNN-T can stream but the fixed prediction network limits context. Systems that need both low latency and high accuracy (live captioning, voice assistants) must compromise, typically using chunk-based encoding with a lookahead buffer.
Noise and channel mismatch. A model trained on headset microphone audio degrades severely when deployed on a distant microphone or a phone call with packet loss. Data augmentation (SpecAugment, room impulse response simulation) closes much of the gap but never entirely.
Hallucination in large models. Whisper-scale models can generate plausible-sounding but incorrect transcripts on silence, music, or very noisy audio. The decoder, conditioned on its own previous outputs, can enter a repetition loop or confabulate words. This is less of an issue in production systems that gate on voice activity detection.
Further Reading
- Conformer: Convolution-augmented Transformer for Speech Recognition (Gulati et al., 2020)
- Robust Speech Recognition via Large-Scale Weak Supervision - Whisper (Radford et al., 2022)
- Sequence Transduction with Recurrent Neural Networks - RNN-T (Graves, 2012)
- wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020)