HuBERT and Discrete Audio Units

Before HuBERT, the core problem with applying BERT-style masked prediction to speech was this: what is the "word" you are predicting? Text has a fixed vocabulary; raw audio is a continuous 80-dimensional mel-spectrogram where adjacent frames are nearly identical, and no natural token boundary exists. wav2vec 2.0 (Baevski et al., 2020) solved this by jointly learning a quantiser and a contrastive objective, but the interaction between quantisation and the masked loss made training fragile. HuBERT takes a different route: decouple the two problems entirely. Cluster first, predict later.

The Offline Clustering Trick

The central idea is refreshingly simple. Before pre-training begins, run k-means on MFCC features (39-dimensional, standard acoustic features) over the training corpus to produce a fixed label sequence for every audio file. Each 20 ms frame gets one of K cluster IDs, say K = 100. These are the "hidden units" in the name.

During pre-training, the model masks a random 40% of the input frames (in contiguous spans, following wav2vec 2.0 conventions) and tries to predict the cluster label of each masked frame using a cross-entropy loss. The loss is applied only to masked positions, exactly like BERT. Unmasked positions provide context but receive no gradient signal.

The encoder is a standard convolutional feature extractor followed by a 12-layer (base) or 24-layer (large) Transformer. The projection head maps Transformer output to a K-class softmax.

raw audio (16 kHz)
     │
[conv feature extractor]   ← 512-dim, 20ms stride
     │
[random masking of ~40% spans]
     │
[12-layer Transformer]
     │
[linear projection → K logits]
     │
cross-entropy vs offline k-means label   (masked frames only)

The elegance is that the cluster labels need not be linguistically meaningful at first. They just need to be consistent enough that predicting them forces the Transformer to learn useful context. Because the loss is masked, the model cannot cheat by memorising local spectral statistics; it must integrate context across the utterance to correctly predict a hidden frame.

The offline clustering is iterated. After the first round of pre-training, the Transformer's internal representations (say, layer 6 of the base model) are themselves used as features for a fresh k-means run. This produces higher-quality cluster labels, which are then used to train a second round of the model from scratch (or from the first checkpoint).

In practice, two iterations are sufficient. The first-iteration labels resemble crude phone-like units. After the second iteration, the correspondence to phonemes sharpens dramatically. On phone purity evaluations (where each cluster is mapped to its most common ground-truth phoneme), second-iteration HuBERT large achieves higher purity than first-iteration HuBERT base, and both exceed wav2vec 2.0's quantiser at matched model sizes.

Why does this work? Intuitively, the Transformer has learned to smooth over within-phone variation in its representations, so k-means on those representations finds clusters that respect phone boundaries more reliably than k-means on raw MFCCs.

HuBERT and Discrete Audio Units

The Offline Clustering Trick

Iterative Refinement of Labels

Keep reading with Pro.