Vision & Multimodal
HuBERT and Discrete Audio Units
HuBERT pre-trains a speech encoder by predicting offline k-means cluster labels for masked audio frames, producing discrete unit sequences that rival phoneme transcriptions without any text supervision.
advanced · 8 min read · Premium
Before HuBERT, the core problem with applying BERT-style masked prediction to speech was this: what is the "word" you are predicting? Text has a fixed vocabulary; raw audio is a continuous 80-dimensional mel-spectrogram where adjacent frames are nearly identical, and no natural token boundary exists. wav2vec 2.0 (Baevski et al., 2020) solved this by jointly learning a quantiser and a contrastive objective, but the interaction between quantisation and the masked loss made training fragile. HuBERT takes a different route: decouple the two problems entirely. Cluster first, predict later.
The Offline Clustering Trick
The central idea is refreshingly simple. Before pre-training begins, run k-means on MFCC features (39-dimensional, standard acoustic features) over the training corpus to produce a fixed label sequence for every audio file. Each 20 ms frame gets one of K cluster IDs, say K = 100. These are the "hidden units" in the name.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.