Vision & Multimodal
CTC: Connectionist Temporal Classification
CTC is a training objective that lets a neural network learn to align variable-length audio to text without any hand-labelled frame-level annotations.
intermediate · 8 min read
Before CTC arrived in 2006, training a speech recogniser meant one thing: you needed someone to tell the model which phoneme was being spoken in every 10-millisecond frame. That alignment step required a bootstrapping HMM, domain-specific tooling, and weeks of engineering. Alex Graves and colleagues asked a simpler question: what if the network could figure out the alignment itself, given only the transcript?
That question produced Connectionist Temporal Classification, still the backbone of production ASR systems at Google, Baidu, and Microsoft two decades later.
The alignment problem, made concrete
Suppose you have a 1-second utterance sampled at 16 kHz with 80-dimensional log-mel features extracted every 10 ms. That gives you roughly 100 frames as input. The transcript is "cat" - three characters. You cannot simply pair frame 1 with "c", frame 2 with "a", frame 3 with "t" and call it a loss. The network has to output 100 symbols, and those symbols need to somehow decode to "cat".
CTC solves this by:
- Expanding the output alphabet with a special blank token (
-). - Allowing the network to emit any label or blank at every frame.
- Defining a many-to-one mapping that collapses repeated characters and strips blanks to recover the transcript.
So the sequence - c c - a - t t - collapses to cat. So does c - a t, or c c a a t, or dozens of other sequences. CTC computes the total probability of all such sequences that collapse to the target, and maximises it.
The forward-backward algorithm and the loss
Let T be the number of input frames and L the label length. The set of valid alignments grows exponentially, but CTC avoids summing over all of them by exploiting a factored structure similar to the HMM forward-backward algorithm.
Define the forward variable alpha(t, s) as the probability of having produced the first s symbols of the target (including blanks) up to time t. The recursion is:
alpha(1, 1) = p(blank | x_1)
alpha(1, 2) = p(y_1 | x_1)
alpha(t, s) = [alpha(t-1, s) + alpha(t-1, s-1)] * p(label_s | x_t)
(with an additional path from s-2 if label_s != label_{s-2})
The total probability of the target is alpha(T, |target_with_blanks|). A symmetric backward variable beta(t, s) is computed right-to-left. The gradient with respect to network outputs follows cleanly from alpha * beta products, making the whole thing differentiable end-to-end.
Computational cost: O(T * L) time and space per utterance, linear in sequence length, cheap enough to run in a training batch.
In PyTorch this is simply:
import torch
import torch.nn as nn
ctc_loss = nn.CTCLoss(blank=0, reduction='mean')
# log_probs: (T, N, C) - log-softmax outputs from the encoder
# targets: (N, S) - padded transcripts
# input_lengths, target_lengths: actual lengths per sample
loss = ctc_loss(log_probs, targets, input_lengths, target_lengths)
The blank=0 convention means index 0 in the vocabulary is reserved for the blank token.
What the network actually learns
CTC pushes the network into a characteristic output pattern. For most frames it emits a high-probability blank (silence, non-informative frames, inter-character gaps). At a few frames it spikes to a specific label. That spike is often highly localised - around the acoustic onset of a phoneme - even though the model was never told anything about phoneme boundaries.
This emergent behaviour has a name: peaky CTC. It is useful because decoding is trivial (argmax over frames, then collapse), but it also reveals a key assumption:
CTC assumes output labels are conditionally independent given the input. The probability at frame t does not depend on what was emitted at frame t-1.
This conditional independence is baked into the factorisation above. It means CTC cannot model the language model component of the output distribution - you either bolt on an external n-gram or neural language model at decode time, or you accept a weaker prior.
Decoding: greedy vs. beam search
Greedy: argmax at each frame, then collapse. Fast, but suboptimal because the highest-probability frame sequence may not correspond to the highest-probability transcript.
Beam search with prefix scores: maintain a beam of partial transcripts and accumulate the probability of all alignments consistent with each prefix. This is the "prefix beam search" described in the Distill.pub guide by Hannun (2017). Integrating a language model is straightforward: at each beam extension multiply in the LM probability of the new token.
Score(transcript) = log P_acoustic(transcript | audio)
+ alpha * log P_lm(transcript)
+ beta * |transcript| # word insertion bonus
The word insertion bonus compensates for the CTC model's tendency to under-predict word boundaries. Tuning alpha and beta on a held-out set typically drops word error rate 1 to 3 points relative.
CTC in the modern ASR stack
CTC did not stay a standalone loss. Several architecture families have adopted it as one component:
| System | CTC role |
|---|---|
| Deep Speech (Hannun et al., 2014) | Primary loss on a 5-layer RNN |
| wav2vec 2.0 (Baevski et al., 2020) | Fine-tuning loss on top of self-supervised representations |
| Conformer (Gulati et al., 2020) | Auxiliary CTC loss (joint with attention) to regularise encoder |
| Whisper (Radford et al., 2022) | Not used; replaced by cross-attention seq2seq |
The Conformer result is instructive: adding a CTC auxiliary loss to a primary attention decoder consistently improves convergence and final accuracy, even when CTC is dropped at inference. The alignment signal CTC provides regularises the encoder to produce more phonetically grounded representations.
When it falls down
Long sequences with repeated labels: the blank-between-repeats rule is necessary but can confuse the model when the same phoneme genuinely repeats in rapid succession (e.g. "bookkeeper"). The model must learn to emit a blank between the two identical characters, but the acoustic evidence for that blank can be vanishingly thin.
Conditional independence: because CTC cannot model the probability that "q" is followed by "u", it relies heavily on an external language model. On domain-shifted data where the LM does not apply well, the acoustic model alone can produce nonsensical outputs.
Encoder must be long enough: CTC requires T >= L (input frames at least as many as output labels). With heavy subsampling (e.g. 8x convolutional stride), short utterances containing many characters can violate this constraint and produce undefined loss.
Monotonic alignment only: CTC assumes the output sequence is a monotonic function of the input - no attending back. This is fine for speech (left-to-right by nature) but rules out translation tasks where word order differs between source and target.
Peaky outputs are brittle under noise: the spike-and-blank pattern is confident by design. Insertions or deletions caused by adversarial or mismatched acoustic conditions are not "softened" by a history of emissions - there is no recurrence in the output distribution. Sequence-to-sequence models with attention handle this more gracefully.