Vision & Multimodal
The CTC Blank Token and Alignment
The CTC blank token is a special output symbol that lets a neural network emit one label per time-step without needing a hand-crafted alignment between audio frames and characters.
intermediate · 7 min read
A 10-second utterance sampled at 16 kHz produces 160,000 raw samples. After windowing and a filterbank, you still get around 1,000 frames. A typical English sentence from that utterance might have 40 characters. You cannot just pair frame 1 with character 1 and train with cross-entropy: the lengths don't match, and the network has no idea which frames are silent, which are transitioning, and which are the "meat" of each phoneme. Before CTC, this problem required a separate forced-alignment step using a Hidden Markov Model, which added complexity and propagated errors.
Connectionist Temporal Classification (CTC), introduced by Graves, Fernandez, Gomez, and Schmidhuber at ICML 2006, dissolves this problem with a single insight: let the network output one symbol per frame from a vocabulary extended by one special symbol, the blank, and then marginalise over every possible alignment that decodes to the target sequence.
What the blank token does
Call the output vocabulary for a task like phoneme recognition the set {a, b, c, ..., z}. CTC adds one more symbol, conventionally written _ or <blank>. At each time step t, the network emits a distribution over this extended vocabulary of size |V| + 1.
The raw frame-level output sequence is called a path. A path is collapsed to a label sequence by two rules applied in order:
- Merge repeated consecutive identical symbols.
- Remove all blank tokens.
So the path _ h h _ e e l _ l l o _ collapses to hello. So does h e l l o. So does _ _ h e l _ l o _. All three are valid alignments for the label sequence hello.
This collapsing map is many-to-one. For a label sequence y of length L and an input of T frames, there can be an exponential number of valid paths. CTC training maximises the sum of probabilities across all of them, which it computes efficiently with a forward-backward dynamic programme analogous to the HMM forward-backward algorithm.
The blank token serves two distinct roles:
- Silence and transitions. Between phonemes, or during breath and pause, the network can emit blank rather than committing to a symbol prematurely.
- Repeated characters. Without blank you could not distinguish
llfroml. Two consecutiveloutputs collapse to one. To emit twollabels, the network must place a blank between them:l _ l.
The forward-backward computation
Let p(k, t) be the network's softmax probability for symbol k at frame t. The CTC loss is the negative log of:
P(y | x) = sum over all paths pi such that collapse(pi) = y of product_t p(pi_t, t)
Computing this sum naively is intractable. The forward variable alpha(s, t) represents the total probability of all paths through the first t frames that produce the first s symbols of the label sequence (using a doubled sequence that inserts blanks between and around every label). The recurrence is:
alpha(s, t) = [alpha(s, t-1) + alpha(s-1, t-1)] * p(y'_s, t)
with extra terms allowing jumps over a blank when the preceding and following symbols differ. The backward pass is symmetric. Multiplying forward and backward variables gives per-frame, per-label posteriors used to compute gradients. This runs in O(T * L) time and O(T * L) space, though the space can be reduced with checkpointing.
How alignment emerges without supervision
At the start of training the network outputs near-uniform distributions, so the gradient signal is weak and spread across all alignments. As training proceeds, the network gradually concentrates probability mass on paths that match the targets. A sharp peak emerges: the network learns to "spike" on the correct character at a specific frame and emit blank everywhere else.
You can visualise this by plotting the argmax output per frame. Early in training it looks like noise. After convergence it looks like a sparse sequence of character spikes separated by long runs of blank. This self-organised alignment is entirely learned from (audio, transcript) pairs with no frame-level labels.
One practical consequence: CTC is monotonic. The blank-and-collapse mechanism enforces left-to-right order. The network cannot decide to emit the last character before the first. This is a feature for streaming inference (you never have to wait for future context to emit a symbol) and a constraint that rules out tasks requiring reordering, such as translating between languages.
Decoding: greedy vs. beam search
Greedy decoding takes the argmax at each frame and collapses. It is fast but misses high-probability paths that do not correspond to a high-probability frame-level argmax sequence. Consider a _ b vs a b: if the model assigns 0.6 to a at frame 1 and 0.6 to b at frame 2 for the first path, but a 0.55 and b 0.55 for the second path with no blank, greedy picks whatever looks best frame by frame and might miss that the second is actually more probable overall.
CTC beam search maintains a set of candidate label prefixes and accumulates probability from all paths consistent with each prefix. It naturally integrates an external language model by multiplying in n-gram or neural LM probabilities at each beam extension. In practice, beam search with a language model closes much of the gap between CTC accuracy and attention-based encoder-decoder accuracy on standard benchmarks.
A subtlety: when extending a beam prefix that ends in label c, two different paths contribute. The new frame could emit c (extending a run of c) or blank followed by c (adding a new c). Beam search must track these two cases separately to avoid double-counting.
When it falls down
Long dependencies across blanks. CTC assumes conditional independence between output labels given the encoder features. This means the model cannot score a label conditioned on what it already emitted. A language model applied during decoding patches this for surface n-gram patterns, but the encoder itself never learns "I just said 'k', so 'n' is likely next in knight". Attention-based models and RNN-T handle this more naturally.
Low blank probability causing spurious emissions. If the encoder is under-regularised or the training data has very dense transcripts (continuous speech with no pauses), the network may develop a high prior on non-blank tokens and emit characters too eagerly. This manifests as repeated or inserted characters in the output. A common fix is to upweight blank in the loss or to tune the softmax temperature at inference.
Very long utterances. The forward-backward computation scales with T * L. For utterances longer than roughly 30 seconds, the intermediate activations can exhaust GPU memory. Practitioners often segment audio at forced boundaries (silence detection) before training and inference.
The peaky output problem and self-supervised fine-tuning. Models fine-tuned from self-supervised representations (e.g., wav2vec 2.0) sometimes produce very peaked CTC distributions that generalise poorly. Several papers have explored entropy regularisation and label smoothing to keep the distributions softer, at the cost of some greedy-decoding accuracy.
Monotonicity is a hard constraint. Languages or tasks that require non-monotonic alignment - such as reading text from right to left or certain cross-lingual transliteration patterns - cannot be handled by CTC. RNN-T and attention-based models do not share this constraint.
Further reading
- Sequence Modeling with CTC - Distill.pub (Hannun, 2017) - the clearest visual walkthrough of alignment, the forward-backward algorithm, and beam search available online.
- Deep Speech: Scaling up end-to-end speech recognition (Hannun et al., 2014) - the paper that brought CTC to large-scale ASR and demonstrated competitive word error rates without pronunciation dictionaries.
- wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020) - shows CTC fine-tuning on top of a self-supervised encoder, achieving strong results with as few as 10 minutes of labelled audio.