Neural Audio Codecs

A 44.1 kHz stereo audio file contains roughly 88,000 floating-point samples per second. A codec language model generates audio token-by-token at perhaps 75 tokens per second. Bridging those two realities is the job of the neural audio codec: compress a waveform into a compact, discrete representation that a language model can treat as vocabulary, then reconstruct perceptually convincing audio from that representation on the other side.

Traditional codecs such as Opus or EVS achieve compression through hand-engineered psychoacoustic models. Neural codecs learn the compression from data, and in doing so produce representations that are far more amenable to downstream generative modelling.

The encoder-quantiser-decoder pipeline

A neural audio codec has three components trained jointly end-to-end.

Encoder. A stack of 1-D convolutional layers (often with increasing dilation) maps a raw waveform to a sequence of dense continuous embeddings at a much lower frame rate. SoundStream (Zeghidour et al., 2021) uses four convolutional blocks with strides of 2, 4, 5, 8, yielding a 320x temporal downsampling; a 24 kHz waveform becomes a 75 Hz embedding sequence.

Residual Vector Quantiser (RVQ). The dense embedding is quantised through a cascade of N codebooks. The first codebook quantises the embedding; the residual error is passed to the second codebook, which quantises that; and so on. With N = 8 codebooks each of size 1024, the total discrete representation is 8 integers per frame.

?wzxhzdk:0?

The bitrate follows directly: 75 frames/sec x 8 codebooks x log2(1024) bits/code = 6000 bits/sec = 6 kbps. Changing N changes the bitrate without retraining, because each codebook is an independent refinement layer.

Decoder. A mirror-image stack of transposed convolutions reconstructs a waveform from the sum of the quantised vectors. The decoder output is trained against a combination of time-domain reconstruction loss (L1 on the waveform), multi-scale STFT losses, and adversarial losses from multi-period and multi-scale discriminators. The adversarial training is what buys perceptual quality: the discriminators push the decoder to sound natural rather than merely minimise mean squared error.

EnCodec and the loss balancer

Meta's EnCodec (Defossez et al., 2022) refined SoundStream in two important ways. First, it replaced the single multi-scale spectrogram discriminator with a multi-scale STFT adversary that operates on overlapping windows at different resolutions, reducing tonal artifacts on music. Second, it introduced a loss balancer: rather than summing all losses with hand-tuned weights, each loss is normalised by an exponential moving average of its own gradient norm before accumulation. This decouples hyperparameter choices from loss magnitude and makes the training far less sensitive to architectural changes.

EnCodec also showed that a small Transformer (6 layers, 8 heads) placed between the RVQ output and the decoder can losslessly re-encode the discrete codes at up to 40% lower bitrate via arithmetic coding, without any quality degradation. This is entropy coding over learned statistics rather than a change to the codec itself.

Why discrete tokens matter for language models

The connection to speech synthesis becomes clear once you ask: what would it take to generate audio with a language model?

A language model operates over a finite vocabulary. Raw waveform samples are continuous and 88,000-dimensional per second, which is not tractable. Mel spectrograms are continuous too. But RVQ codes are integers drawn from a fixed codebook, exactly the kind of discrete tokens a transformer can predict with a softmax.

VALL-E (Wang et al., 2023) exploits this directly. It conditions an autoregressive transformer on a phoneme sequence and a 3-second speaker prompt, then predicts the EnCodec RVQ codes for the target utterance. Because the codec learned perceptually meaningful representations from data, the language model only needs to match those codes; the decoder handles the rest. VALL-E demonstrated zero-shot voice cloning at a naturalness level that prior parametric TTS could not match, simply by treating speech generation as next-token prediction over codec codes.

The key insight: the quality ceiling of any codec language model is set by the codec's reconstruction fidelity. If the codec introduces artifacts at a given bitrate, the language model cannot recover them.

Codebook utilisation and the collapse problem

RVQ training is prone to codebook collapse: a large fraction of codebook entries go unused because gradient updates push many embeddings to cluster around a few centroids. A codec with nominal 1024-entry codebooks that actually uses only 200 entries has far less representational capacity than it appears.

Mitigations include:

Technique	Mechanism
Exponential moving average (EMA) updates	Updates codebook vectors via momentum rather than backprop; more stable than pure gradient descent
Random codebook restart	Periodically reinitialise dead entries to a randomly sampled encoder output
Commitment loss	Penalises the encoder for drifting away from its nearest centroid: `beta * \|\|z - sg(q)\|\|^2`
Product quantisation	Splits the embedding into sub-vectors before quantising; each sub-space is smaller and less prone to collapse

The Improved RVQGAN paper (Kumar et al., 2023) showed that combining EMA updates, periodic restarts, and a multi-band discriminator increased codebook utilisation significantly and enabled a single model to compress speech, music, and environmental audio at 44.1 kHz down to 8 kbps with high perceptual fidelity.

When it falls down

Very low bitrates. Below roughly 3 kbps (N = 2 codebooks at 75 Hz), RVQ cannot represent fine spectral detail. Fricatives and sibilants become blurred; music loses pitch accuracy. The codec degrades more gracefully than traditional codecs in terms of avoiding audible blocking artifacts, but the perceptual floor is still real.

Out-of-distribution audio. A codec trained on speech and music may quantise novel timbres (unusual instruments, non-speech vocalisations) poorly. The nearest-neighbour lookup finds the closest training-distribution centroid, which may not be close at all. Reconstruction sounds like a garbled version of whatever the codec has seen most.

Long contexts for language models. Even at 75 tokens/sec, a 30-second clip produces 2,250 tokens per codebook, times N codebooks. Autoregressive generation over that span is expensive. Hierarchical schemes (coarse codebooks first, fine codebooks conditioned on coarse) and non-autoregressive decoders are active research directions.

Phase reconstruction. RVQ operates on magnitude-like features (the encoder sees the waveform but the loss is dominated by spectral terms). Phase relationships can be imprecise, causing comb-filtering artifacts on sustained tones and slight hollowness on singing voice.

Mismatch between codec training and downstream use. A language model trained on codec tokens inherits any systematic biases in the codec. If the codec was trained on one sampling rate and the downstream system uses another, quality degrades substantially. Resampling before encoding is required but not always documented in pipelines.

The encoder-quantiser-decoder pipeline

EnCodec and the loss balancer

Why discrete tokens matter for language models

Codebook utilisation and the collapse problem

When it falls down

Further reading