HiFi-GAN and Neural Vocoders

A system that could generate human-quality speech in 1952 required a room full of analog circuitry and produced muffled, robotic output. In 2020, a single GPU running HiFi-GAN synthesised the same quality 167.9 times faster than the audio it produced. The gap was not closed by better hardware; it was closed by rethinking what a vocoder actually needs to learn.

What a vocoder does and why it is hard

In a modern text-to-speech (TTS) pipeline, the front end handles text normalisation and prosody prediction, an acoustic model (Tacotron 2, FastSpeech 2, etc.) produces a mel-spectrogram, and the vocoder converts that spectrogram back into a time-domain waveform. The vocoder is the last mile, and it is disproportionately responsible for perceived audio quality.

The challenge is a reconstruction problem with a lossy intermediate. A mel-spectrogram discards phase information. Two waveforms with identical magnitude spectra but different phases sound completely different. A vocoder must therefore hallucinate a plausible phase trajectory consistent with the magnitude envelope, at 22,050 samples per second or higher.

Classical vocoders (WORLD, STRAIGHT) solved this analytically: decompose speech into a source-filter model with a voiced/unvoiced flag, a pitch estimate, and spectral envelope, then resynthesize. Fast, but the vocoded speech sounds processed because the model is too simple for the residual complexity of real speech.

The autoregressive detour: WaveNet and its cost

WaveNet (van den Oord et al., 2016) demonstrated that a deep dilated causal convolutional network, trained with conditional log-likelihood on raw waveform samples, could produce strikingly natural speech. The quality improvement was immediate and large. The speed problem was severe: the autoregressive dependency means sample t cannot be computed until sample t-1 is known, so synthesis is sequential and slow.

WaveNet's receptive field covers thousands of samples, but the generation loop runs one sample at a time at 16-24 kHz. For a 5-second utterance at 22 kHz that is 110,000 sequential forward passes. Parallel WaveNet (2018) and WaveGlow (2018) addressed this with flow-based or distillation-based approaches, trading architectural complexity for parallelism. They worked, but they were expensive to train and required careful hyperparameter management.

The community needed something both fast and simple.

HiFi-GAN: the generator and its multi-receptive-field fusion

HiFi-GAN (Kong, Kim, and Bae, NeurIPS 2020) frames vocoding as a GAN problem. The generator is a fully convolutional network that upsamples a mel-spectrogram in stages until the temporal resolution matches the target sample rate.

The key architectural insight is the Multi-Receptive Field Fusion (MRF) module. At each upsampling stage, several residual blocks with different kernel sizes and dilation rates run in parallel, and their outputs are summed:

Input feature map (at current resolution)
  ├── ResBlock(kernel=3, dilations=[1,3,5])
  ├── ResBlock(kernel=7, dilations=[1,3,5])
  └── ResBlock(kernel=11, dilations=[1,3,5])
          ↓
       Element-wise sum → next upsampling stage

This means the generator simultaneously models short-range (formant transitions) and long-range (pitch periodicity, ~200 samples at 22 kHz) structure without having to choose a single receptive field size. The upsampling factors are chosen so their product equals the mel hop length (typically 256), ensuring temporal alignment between the spectrogram and the output waveform.

The discriminators: why two families

A GAN is only as good as its discriminator's signal. HiFi-GAN uses two families:

Multi-Scale Discriminator (MSD): Three discriminators operate on the waveform at different temporal resolutions (original, 2x downsampled, 4x downsampled). Each uses strided convolutions with a mix of kernel sizes. Operating at multiple scales forces the generator to produce consistent structure from fine phoneme detail up to phrase-level rhythm.

Multi-Period Discriminator (MPD): Five discriminators each reshape the 1D waveform into a 2D matrix by folding at a fixed period (2, 3, 5, 7, 11 samples). They then apply 2D convolutions. The motivation: speech is a superposition of quasi-periodic signals (pitch harmonics, formants). Folding at period p explicitly aligns all samples that are p apart, so the discriminator can learn whether harmonic relationships are correct.

Discriminator	What it sees	What failure it catches
MSD (scale 1)	full-resolution waveform	fine-grained spectral inconsistency
MSD (scale 2-3)	downsampled signals	coarse temporal structure
MPD (period 2-11)	2D periodic folding	phase incoherence, harmonic distortion

The training objective combines adversarial loss (least-squares GAN), feature matching loss (L1 distance on intermediate discriminator activations), and mel-spectrogram reconstruction loss. The mel loss is crucial for stability early in training when the adversarial signal is uninformative.

Speed-quality variants

HiFi-GAN ships in three configurations trading model size for speed:

Model	Params	Real-time factor (V100)	MOS
V1	14M	167.9x	~4.4
V2	0.9M	~16x faster than V1	~4.2
V3	0.26M	13.4x (CPU)	~4.0

V1 is the quality target; V3 is deployable on CPU in real-time applications. The MOS gap between V1 and ground truth was small enough that the paper could claim near-human quality on single-speaker data.

When it falls down

Out-of-distribution spectrograms. The generator is conditioned on mel-spectrograms and learns to invert them precisely. When the upstream acoustic model produces spectrograms that differ from the training distribution (different pitch range, unusually long silences, very fast speech), artefacts appear as buzzing, metallic resonance, or interrupted periodicity. Fine-tuning the vocoder on matched acoustic-model output is standard practice.

High-frequency content. Mel filterbanks are logarithmically spaced and lose resolution above 8-10 kHz. HiFi-GAN therefore cannot accurately reconstruct air noise, sibilants (/s/, /f/), and fricatives at 44 kHz target rates. Systems targeting broadcast quality use linear spectrograms or auxiliary high-frequency predictors.

Speaker generalisation. A HiFi-GAN trained on one speaker synthesises other speakers poorly. Multi-speaker training helps but requires diverse data and often a speaker embedding conditioning mechanism not present in the original architecture.

Very long silences and audio artefacts. Because the generator is purely feedforward with no language model prior, it can produce ticking or clicking artefacts during long pauses where no energy is expected. These appear as isolated frame activations where the discriminators lack coverage.

Normalised spectrogram assumptions. HiFi-GAN is sensitive to the exact mel-spectrogram normalisation (log scale, dynamic range clipping, number of mels) used at training time. Mismatched preprocessing between training and inference is a common silent failure mode that degrades quality without obvious error signals.