Listen, Attend and Spell

Before 2015, building a competitive speech recogniser required assembling at least three separately trained components: an acoustic model, a pronunciation lexicon, and a language model. Each had its own training objective, its own failure modes, and its own set of engineering decisions that did not necessarily align with the final goal of minimising word error rate. Listen, Attend and Spell (LAS), published by Chan, Jaitly, Le, and Vinyals in 2015, compressed all three into a single neural network trained on one loss: the probability of the correct character sequence given the audio.

Architecture: listener, attender, speller

LAS has three named pieces that map cleanly onto encoder, attention, and decoder.

Listener (encoder). The raw input is an 80-dimensional log-mel filterbank computed over 25 ms frames with 10 ms stride. The listener is a pyramidal bidirectional LSTM (pBLSTM) that halves the time resolution at each layer by concatenating adjacent hidden states before feeding them upward.

Layer 1 (BiLSTM):  T   frames  -> hidden states h_1..h_T
Layer 2 (pBLSTM):  T/2 states  -> (concat h_{2i}, h_{2i+1}) per step
Layer 3 (pBLSTM):  T/4 states  -> further halving

Without this pyramid, the O(T) sequence fed into the attention mechanism is too long for the decoder to attend to efficiently; a 10-second utterance at 10 ms frames gives 1000 steps, which buries the attention gradient signal. The pyramid reduces this to ~125 steps by layer 3.

Attender (content-based attention). At each decoder step i, the attention mechanism computes a scalar energy e_{i,u} for every encoder state h_u, then softmaxes them into weights alpha_{i,u}:

e_{i,u}  = <v, tanh(W * s_i  +  V * h_u  +  b)>
alpha_i  = softmax(e_i)
c_i      = sum_u  alpha_{i,u} * h_u

Here s_i is the decoder hidden state at step i and c_i is the context vector passed into the speller. This is standard additive (Bahdanau) attention, but in LAS it attends over the compressed encoder output rather than over raw frames.

Speller (decoder). A two-layer LSTM decoder autoregressively generates one character per step. Its input at step i is the embedding of the previous character y_{i-1} concatenated with c_i. During training, teacher forcing feeds the ground-truth character; during inference, beam search over the character vocabulary is used. Output is a distribution over roughly 33 characters (26 letters, digits, space, apostrophe, end-of-sequence token).

The full training objective is simply:

L = -sum_i  log P(y_i | y_{<i}, x)

No CTC auxiliary loss, no alignment supervision.

Why pyramidal subsampling matters

A flat BiLSTM encoder without subsampling would feed all T encoder states into the attention mechanism. The decoder must learn to focus on the right 10-20 consecutive frames for each output character. With a long, dense sequence, the softmax over all T states spreads probability mass thinly, and gradients from misaligned attention steps are noisy. The pyramid forces the model to build coarser acoustic representations, making the alignment task tractable.

This also drastically reduces computation. Attention in LAS is O(T' * L) where T' is the subsampled length and L is the output length. With T' around 125 and L around 50 characters for a typical utterance, the attention matrix is small enough to compute exactly without approximation.

Training details and results

The original paper trained on Google's proprietary voice search corpus with roughly 3 million utterances. Training used scheduled sampling: with probability p the decoder was fed its own previous prediction rather than the ground-truth character, gradually building robustness to its own errors. Without this, the model becomes brittle at inference time because the decoder has only ever seen gold character prefixes during training.

On the Google voice search test set, LAS achieved:

Condition	WER (%)
LAS, no LM	14.1
LAS + language model rescoring	10.3
State-of-art HMM-DNN (2015)	~8-9

The gap to a well-tuned HMM-DNN pipeline narrowed substantially once more data was added. Later work in the same paradigm, particularly with attention smoothing and better training recipes, closed it entirely. The significance was not immediately a lower number - it was the elimination of the pronunciation lexicon and the multi-stage pipeline.

Attention as alignment

One underappreciated benefit of content-based attention is that the attention weights alpha_{i,u} provide a soft alignment between output characters and encoder frames. You can visualise this matrix for any utterance and see (roughly) which acoustic region each character attends to. This is not forced monotonic alignment; the model is free to attend to any frame at any step. In practice, the learned attention is nearly monotonic for English speech, but the model can attend backward, which sometimes helps with coarticulation and fricatives.

Compare this to CTC, which enforces conditional independence between output tokens and marginalises over all valid alignments. CTC cannot use future context to disambiguate ambiguous frames because each output is conditioned only on the portion of the input up to that point. LAS has no such constraint.

When it falls down

Streaming is not natural. The attention mechanism needs access to the full encoder output before it can attend over it. The pBLSTM is bidirectional, so it also requires the full utterance. Naive LAS cannot produce partial hypotheses as audio arrives. Various patches exist (monotonic attention, online decoding with look-ahead), but they sacrifice either accuracy or simplicity.

Silence and long pauses break attention. If an utterance contains several seconds of silence, the encoder produces many near-identical representations of silence. The decoder's attention mechanism must still scan over these frames, and it can get stuck attending to the wrong silence region, generating repeated characters or skipping words.

Data hunger. Without a pronunciation lexicon to bootstrap from, LAS must learn phoneme-to-character correspondences entirely from data. On low-resource languages or domain-specific vocabulary (medical terms, proper nouns), this requires substantially more paired data than a hybrid HMM-DNN system that can leverage a separately constructed lexicon.

Long utterances degrade. Attention quality tends to degrade on utterances longer than those seen during training. If the training corpus is mostly short voice queries (under 10 seconds), the model generalises poorly to longer dictation. The quadratic cost of attending over a long encoder output is also computationally expensive at inference.

Beam search does not have an explicit language model. The character-level decoder implicitly learns some language model behaviour, but it is much weaker than an explicit word-level LM trained on a large text corpus. On domains with unusual vocabulary, rescoring with an external LM is effectively mandatory.

Architecture: listener, attender, speller

Why pyramidal subsampling matters

Training details and results

Attention as alignment

When it falls down

Further reading