Vision & Multimodal
Audio Features and Spectrograms
Raw audio waveforms are rarely fed directly to speech models; this concept explains how and why they are first converted into spectrogram-based representations that compress perceptual information into a learnable 2-D grid.
beginner · 8 min read
A microphone records air pressure 16,000 times per second. A 10-second sentence therefore arrives as 160,000 raw floating-point numbers. Feed those directly to a transformer and you get sequences so long that attention cost becomes crippling, and the model must discover for itself which microscopic pressure differences are linguistically meaningful. Every major speech recognition system from HTK in the 1990s to Whisper in 2022 skips that step by converting the waveform into a compact time-frequency image before any learned layer sees it.
From Waveform to Spectrum: The Short-Time Fourier Transform
A full Fourier transform would tell you which frequencies exist across the entire recording but not when they occurred. Speech is non-stationary: phonemes change on timescales of 20-100 ms. The Short-Time Fourier Transform (STFT) handles this by slicing the signal into overlapping windows and running a DFT on each slice.
Concretely, given a waveform x[n] and a window function w of length N:
X[m, k] = sum_{n=0}^{N-1} x[m*H + n] * w[n] * exp(-j*2*pi*k*n/N)
where m is the frame index, H is the hop length (stride between frames), and k indexes frequency bins. The magnitude squared, |X[m,k]|^2, is the power spectrogram: a 2-D array with time on one axis and linear frequency on the other.
Typical choices for 16 kHz speech:
| Parameter | Common value | Effect |
|---|---|---|
| Window length | 25 ms (400 samples) | Resolves ~40 Hz frequency bins |
| Hop length | 10 ms (160 samples) | 100 frames per second |
| Window type | Hann | Reduces spectral leakage at frame boundaries |
| FFT size | 512 | Gives 257 unique frequency bins (0-8 kHz) |
A 10-second clip at these settings produces a 1000 x 257 matrix, which is already 160x shorter in the time dimension than the raw waveform.
The Mel Scale: Matching Human Perception
Linear frequency bins are wasteful for speech. Human hearing distinguishes pitches finely at low frequencies (say, 100 Hz vs. 200 Hz is a big perceptual jump) but coarsely at high frequencies (3000 Hz vs. 3100 Hz sounds nearly identical). The mel scale, proposed by Stevens, Volkmann, and Newman in 1937, maps linear frequency f to a perceptual scale:
mel(f) = 2595 * log10(1 + f / 700)
In practice, a bank of triangular filters is placed at mel-spaced centre frequencies. Each filter integrates the power spectrum over its bandwidth, collapsing 257 linear bins down to typically 80 or 128 mel bins. The result is the mel spectrogram: a (T x 80) matrix where each row is a 10 ms snapshot of frequency energy distributed perceptually.
Taking the log of mel filter-bank energies is standard practice because:
- Perceived loudness scales logarithmically (decibels).
- Log compression reduces the dynamic range, which benefits gradient-based training.
- It converts multiplicative noise (e.g., channel effects) into additive noise, making features more robust.
Whisper, for instance, uses 80 log-mel bins computed from a 25 ms Hann window at 10 ms hops over audio resampled to 16 kHz. The feature extractor is entirely deterministic and non-learned.
MFCCs: Compressing Further with the DCT
Before neural networks dominated speech processing, statistical models (GMM-HMMs) required features with low dimensionality and approximately uncorrelated dimensions. Mel-Frequency Cepstral Coefficients (MFCCs) achieve this by applying a Discrete Cosine Transform (DCT) to the log mel filter-bank energies:
c[n] = sum_{m=1}^{M} log(S[m]) * cos(pi*n*(m - 0.5)/M), n = 0, 1, ..., K-1
Here S[m] are the mel filter-bank outputs and K (typically 13) is the number of cepstral coefficients to retain. The DCT decorrelates the filter-bank energies and concentrates most speech information in the first few coefficients.
In practice, MFCCs are augmented with their first and second time derivatives (deltas and delta-deltas), giving 39 features per frame. This pipeline dominated the field for roughly two decades and remains a useful baseline.
For modern deep-learning ASR, log mel filter banks are generally preferred over MFCCs because:
- Neural networks can learn their own decorrelation implicitly.
- Discarding higher DCT coefficients throws away information that deep models could exploit.
- Mel filter banks are easier to implement correctly and to augment (see SpecAugment below).
SpecAugment: Treating the Spectrogram as an Image
Once audio is in the form of a 2-D grid, standard image augmentation techniques become applicable. SpecAugment (Park et al., 2019) applies three transforms directly to the log mel spectrogram during training:
- Time warping - randomly stretches or compresses a segment along the time axis.
- Frequency masking - zeroes out
Fconsecutive mel bins, chosen uniformly. - Time masking - zeroes out
Tconsecutive time steps, chosen uniformly.
These augmentations are applied after the feature extraction step and before the encoder, meaning they require no changes to the acoustic model architecture. They proved remarkably effective: on LibriSpeech test-other, SpecAugment reduced WER from roughly 12% to 6.8% on a Listen-Attend-Spell model without a language model.
The intuition is that speech is locally redundant: you can reconstruct a masked phoneme from context. Training the model to do so forces it to rely on broader acoustic and linguistic patterns rather than memorising narrow frequency cues.
When It Falls Down
Reverberation and noise. Log mel features assume a relatively clean, close-microphone signal. In reverberant environments, energy from one frame smears into adjacent frames (time-domain convolution appears as multiplicative spectral distortion). Feature normalisation (cepstral mean and variance normalisation, or per-utterance mean subtraction) helps but does not fully solve this; the fundamental mismatch between training (studio) and test (far-field) conditions remains a persistent failure mode.
Very short or very long windows. A 25 ms window is a compromise. Shorter windows improve temporal resolution but degrade frequency resolution (Heisenberg-type uncertainty: delta_t * delta_f >= 1/(4*pi)). Longer windows miss rapid phonemic transitions. For tonal languages where pitch carries meaning, dedicated pitch features are sometimes appended to the standard filter banks.
Normalisation sensitivity. Log mel features can vary by 20-30 dB between a quiet whisper and a loud shout. Models trained on normalised data can fail catastrophically on un-normalised input. Whisper sidesteps this partly by padding all inputs to 30 seconds and using fixed log-mel computation, but amplitude normalisation still matters in deployment.
Sampling rate mismatch. Features computed at 16 kHz and features computed at 8 kHz (telephone-band speech) are not interchangeable. A model trained on 16 kHz audio will see an entirely different mel filter-bank response if given 8 kHz input, even after resampling. This is a common silent failure in production pipelines.
Discrete artefacts at frame boundaries. A non-integer hop size relative to the sample rate causes frame boundaries to drift over time, producing subtle periodic artefacts that occasionally confuse models trained on perfectly aligned features.
Further Reading
- SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition - Park et al., 2019; introduces frequency and time masking on log mel spectrograms.
- Robust Speech Recognition via Large-Scale Weak Supervision (Whisper) - Radford et al., 2022; describes Whisper's fixed 80-bin log mel front-end.
- torchaudio Transforms Documentation - reference implementation of
MelSpectrogram,MFCC,Spectrogram, andSpecAugmentin PyTorch.