Why Sequence Length Is Hard in Audio

One second of speech sampled at 16 kHz gives you 16,000 raw samples. After a short-time Fourier transform with a 25 ms window and 10 ms hop, that same second becomes about 100 frames of 80-dimensional log-mel filterbank features. A two-minute recording therefore produces roughly 12,000 frames. The transcript might be 300 words, perhaps 400 characters. You are trying to align a sequence of length 12,000 with a sequence of length 400. That ratio is the central engineering problem in automatic speech recognition (ASR).

The length mismatch and why naive alignment fails

In machine translation you map a sentence of ~20 tokens to another sentence of ~20 tokens. The sequences are roughly the same length, so attention is tractable and cross-entropy training is straightforward. In ASR the input is 10 to 50 times longer than the output. Several consequences follow immediately.

Quadratic attention cost. Standard self-attention scales as O(T^2) in both time and memory, where T is the input sequence length. At T = 1200 frames (a 12-second utterance), the attention matrix alone occupies ~1200^2 = 1.44 million entries per head. With 8 heads and float32 that is already 46 MB for one example in one layer. Full-context attention over long-form audio is therefore expensive, not merely inconvenient.

Unknown alignment. You do not know, for each output character, which input frame it corresponds to. A human annotator could locate "the k in 'truck'" to within a few frames, but that annotation is expensive and does not generalise. Every major ASR architecture is really an answer to the question: "how do we handle an unknown, monotone alignment between a long input and a short output?"

Variable length across the batch. Utterances in a training batch range from under one second to twenty seconds or more. Padding shorter sequences to match the longest one wastes computation and memory, but the alternatives (bucketing, dynamic batching) add engineering complexity.

How CTC sidesteps the alignment problem

Connectionist Temporal Classification (CTC), introduced by Graves et al. in 2006 and scaled industrially by Deep Speech (Hannun et al., 2014), solves alignment by marginalising over all possible alignments at training time. The model emits a probability distribution over (vocabulary + blank) at every frame, and the CTC loss sums the probability of all frame sequences that collapse to the target transcript after removing blanks and repeated tokens.

The key constraint CTC imposes is conditional independence: the output probability at frame t depends only on the encoder state at t, not on previously emitted tokens. This makes the forward-backward algorithm factorisable and tractable, but it prevents the model from learning a language model internally. CTC outputs are therefore usually combined with an external n-gram or neural language model at decode time.

CTC also requires T >= L (input length must be at least as long as the output). This is nearly always satisfied for audio, but it explains why you cannot apply CTC to character-level outputs when the audio is very short or heavily downsampled.

How RNN-T handles streaming and context

The RNN Transducer (Graves, 2012) extends CTC by adding a separate recurrent prediction network that conditions each output token on previously emitted tokens. The joint network combines encoder state at frame t with prediction network state after k emissions.

Python source
    │
    ▼
TorchDynamo      (trace and capture the graph)
    │
    ▼
AOT Autograd     (capture the backward pass ahead-of-time)
    │
    ▼
Compiler backend (default: TorchInductor)
    │
    ▼
Triton / C++     (generated kernel code)

The lattice is now T x K rather than just T, which is larger than CTC's T x |vocab| matrix but the monotone constraint means you only ever move right (advance frame) or up (emit token), so dynamic programming over this lattice remains O(T * K).

Because the predictor runs left-to-right in real time, RNN-T is inherently streaming: you can emit tokens as frames arrive without waiting for the whole utterance. This is why RNN-T became the de facto standard for on-device and latency-sensitive ASR (Google's streaming recogniser, Apple's Siri, etc.).

The cost is training complexity: the full-sequence loss requires materialising the T x K x |vocab| lattice, which is large and must be computed on GPU with custom CUDA kernels (the warp-transducer or torchaudio.transforms.RNNTLoss paths).

How the Conformer manages long local context

Attention sees the whole sequence but is weak at fine-grained local patterns; convolution is strong at local patterns but has limited receptive field. In speech, both matter: phonetic detail is local (a few frames), prosody and word identity are global (hundreds of frames).

The Conformer (Gulati et al., 2020) stacks, per layer:

A half-step feed-forward module
Multi-head self-attention (global context)
A depthwise convolution module (local context, kernel size ~31 frames)
Another half-step feed-forward module

The convolution module handles the local phonetic structure cheaply while attention handles longer dependencies. This is why Conformer-based models consistently outperform pure Transformer or pure RNN baselines on LibriSpeech without requiring longer sequences or heavier attention.

The architectural insight is that "sequence length is hard" partly because no single operation type handles the full range of relevant timescales in speech.

Whisper's fixed-window approach

Whisper (Radford et al., 2022) takes a different route: rather than learning to handle variable lengths, it standardises them. Every input is padded or trimmed to exactly 30 seconds (3000 mel frames after 2x downsampling). The encoder therefore always sees a fixed-size input, and the decoder cross-attends to a fixed-length memory.

This works surprisingly well for transcription but has two consequences worth knowing:

Utterances shorter than 30 seconds carry substantial padding, which the model learns to ignore via the log-mel zeros, but this wastes some encoder capacity.
Utterances longer than 30 seconds must be split into chunks with a sliding window, and the model uses timestamps to stitch segments. Errors accumulate at chunk boundaries, particularly for music or noisy speech where the timestamp predictor can drift.

The fixed-window approach trades streaming capability for training simplicity. Whisper cannot run in low-latency streaming mode; it is inherently a batch transcription system.

When it falls down

Very long-form audio. Even with chunking, models like Whisper accumulate errors over 30+ minute recordings. The attention context resets at each chunk boundary, so discourse-level context (speaker names, topic continuity) is lost.

High-frame-rate inputs without subsampling. If you feed raw 16 kHz spectrogram frames without subsampling, a 10-second clip is 1000 frames. Conformer and Transformer models typically apply 4x or 8x convolutional subsampling in the frontend to bring this down to 125-250 frames. Forgetting subsampling causes GPU OOM or extremely slow training.

CTC on characters with heavy downsampling. If the encoder subsamples 8x, a 1-second clip yields only about 12 encoder frames. You cannot reliably decode to more than 12 characters. This forces practitioners to use subword or word-piece output units rather than characters when subsampling is aggressive.

Streaming with large look-ahead. RNN-T and chunk-based attention models that look ahead by N frames to improve accuracy pay a latency penalty of N x 10 ms. There is a direct trade-off between word error rate and end-of-utterance latency that no architecture fully escapes.

Mismatched utterance lengths at inference. A model trained on utterances up to 20 seconds may behave unpredictably on 90-second segments. Positional encodings (especially absolute sinusoidal ones) can generalise poorly past their training range, causing the encoder to produce degenerate representations for positions it never saw during training.