Vision & Multimodal
The Conformer Architecture
The Conformer interleaves convolution and multi-head self-attention inside each encoder block to capture both fine-grained local acoustic patterns and long-range sequence dependencies, achieving state-of-the-art ASR accuracy on LibriSpeech.
intermediate · 8 min read
Attention is good at relating tokens across the entire sequence. Convolution is good at extracting local patterns from nearby frames. Before 2020, ASR models used one or the other. The Conformer, introduced by Gulati et al. at Google, asks why you would choose.
On LibriSpeech test-clean / test-other, a Conformer-L with 118 M parameters achieved 2.1% / 4.3% WER without any external language model, and 1.9% / 3.9% WER with one. Those numbers closed a substantial gap over pure-transformer and pure-convolutional baselines at the time. The key architectural move was surprisingly compact: wrap every encoder layer with a specific ordering of four sub-modules, each residually connected.
The Four-Module Block
A single Conformer encoder block applies sub-modules in this order:
x -> [Feed-Forward (half-step)] -> [Multi-Head Self-Attention] ->
[Convolution Module] -> [Feed-Forward (half-step)] -> LayerNorm -> x'
The two feed-forward modules each use a half-residual scaling of 0.5, so together they approximate the effect of one full Macaron-style feed-forward layer. Concretely, for input x:
x1 = x + 0.5 * FFN1(x)
x2 = x1 + MHSA(x1)
x3 = x2 + Conv(x2)
x4 = x3 + 0.5 * FFN2(x3)
out = LayerNorm(x4)
The self-attention uses relative positional encodings (Transformer-XL style) rather than fixed absolute sinusoids. This matters for ASR because utterance lengths vary widely; relative encodings generalise better across lengths unseen at training time.
The Convolution Module in Detail
The convolution sub-module is the heart of what separates Conformer from a plain transformer encoder. It consists of:
- Pointwise conv (expansion): Projects channels from
d_modelto2 * d_model. - GLU activation: Gated Linear Unit splits the
2dtensor along the channel axis and applies an elementwise gate, halving channels back tod_model. This controls information flow before the depthwise step. - Depthwise conv: A single depthwise convolution with kernel size
k(typically 31 or 15). Each channel convolves independently, keeping the parameter count low. - Batch normalisation: Applied after the depthwise conv. This is one of the few places Conformer uses BatchNorm rather than LayerNorm; the temporal smoothing it provides is beneficial for mel-spectrogram-derived features.
- Swish activation.
- Pointwise conv (projection): Returns to
d_model.
| Sub-step | Operation | Role |
|---|---|---|
| Pointwise expand | 1x1 conv, d -> 2d | Increase capacity |
| GLU | Gated split | Selective gating |
| Depthwise conv | k-wide, groups=d | Local pattern capture |
| BatchNorm + Swish | Normalise, activate | Stability |
| Pointwise project | 1x1 conv, d -> d | Dimension restore |
The depthwise kernel spans k consecutive time steps, so with k=31 and a 10 ms frame shift it captures roughly 310 ms of local context. Attention then relates those locally-enriched representations across the full sequence.
Why the Ordering Matters
The original paper ablated different orderings. Placing convolution after attention, rather than before it, gave consistently lower WER. The intuition is that attention can first align distant dependencies and reduce sequence-level ambiguity; convolution then refines local patterns on top of that contextualised representation. The reverse ordering forces convolution to process uncontextualised features, which is less efficient when phonemes span variable-length acoustic realisations.
The Macaron-style feed-forward sandwich (half-step before and after attention + convolution) was also ablated. Replacing the two half-residual FF layers with a single full-residual FF layer degraded results slightly, suggesting the symmetric gating helps gradient flow through the deeper stacks (up to 17 layers for Conformer-L).
Three model sizes were reported in the original paper:
| Variant | Params | d_model | Heads | Layers | test-clean WER |
|---|---|---|---|---|---|
| Conformer-S | 10.3 M | 144 | 4 | 16 | 2.7% |
| Conformer-M | 30.7 M | 256 | 4 | 16 | 2.3% |
| Conformer-L | 118.8 M | 512 | 8 | 17 | 2.1% |
Integration with Sequence-to-Sequence Training
Conformer is an encoder architecture. It consumes log-mel filterbank features (typically 80 channels, 25 ms windows, 10 ms hop) and produces a sequence of contextualised frame-level embeddings. These embeddings can then feed a variety of decoding heads:
- CTC head: A linear projection to vocabulary logits followed by CTC loss. Simple, parallelisable during training, greedy-decodable.
- Attention decoder: A standard transformer decoder that attends to encoder outputs. Produces sharper posteriors but requires full-sequence encoding before decoding begins (not streaming-friendly).
- RNN-T head: An RNN transducer joiner that conditions on both encoder and prediction-network outputs. Streaming-compatible and the most common deployment choice for Conformer-based production systems.
Google's production ASR and several open-source toolkits (ESPnet, NeMo) all adopt the Conformer encoder with a CTC + attention-decoder hybrid objective, where CTC regularises the encoder to emit monotonically aligned features even though the attention decoder is the primary output path.
When It Falls Down
Long-form audio. The self-attention component scales quadratically with sequence length. An utterance of 30 seconds at a 10 ms frame shift produces 3,000 tokens. A full Conformer-L running full attention on that sequence is memory-intensive. Streaming variants (chunk-based attention, Emformer, cache-based designs) address this at the cost of some context at chunk boundaries.
Noisy / reverberant conditions. Depthwise convolution captures local temporal patterns on top of mel features, but mel features themselves are not robust to reverberation. Conformer encoders trained on clean or lightly augmented data degrade noticeably in far-field microphone scenarios without SpecAugment or multi-condition training.
Low-resource languages. Conformer-L with 118 M parameters needs a substantial amount of labelled data to converge well. On languages with only a few hundred hours of transcribed speech, smaller architectures or self-supervised pretraining (wav2vec 2.0, WavLM) followed by fine-tuning will often outperform a Conformer trained from scratch.
BatchNorm and domain shift. The BatchNorm inside the convolution module computes running statistics over training batches. At inference with small batch sizes or out-of-distribution acoustic conditions, these statistics can become mismatched. Replacing BatchNorm with LayerNorm or GroupNorm is a common adaptation in production pipelines that serve heterogeneous audio sources.
Depthwise kernel rigidity. The 31-frame receptive field is fixed at training time. It works well for standard phoneme-level patterns but does not adapt dynamically to acoustic event duration the way attention does. On tasks with very short segments (keyword spotting at <100 ms budgets) the kernel often needs to shrink, reducing the architecture's advantage over pure-attention baselines.