NF4 and Double Quantisation

A 65-billion-parameter model needs roughly 130 GB in BFloat16. A single H100 SXM holds 80 GB. Without quantisation, fine-tuning that model is physically impossible on one GPU. NF4, introduced in the QLoRA paper (Dettmers et al., 2023), closes that gap not by approximating the model crudely, but by exploiting a structural fact about how pretrained weights are distributed: they are almost always approximately Gaussian.

Why the weight distribution matters for quantisation

Standard 4-bit integer formats (INT4) map values to equally spaced bins across a fixed range. If your weights are clustered near zero with light tails, most of those 16 bins are wasted on the extremes where almost no weight values live. You are paying 4 bits per weight but using the resolution of perhaps 2 or 3 effective bits.

A better strategy is to place quantisation levels where the data actually is, which is what quantile quantisation does. Given a distribution, you find the 16 quantiles and use those as the bin boundaries. Each bin then represents an equal fraction of the probability mass, which is the information-theoretically optimal arrangement for a fixed bit-width.

The problem with naive quantile quantisation is that you need to estimate the quantiles from data at runtime, which is slow and per-tensor. NF4 sidesteps this by pre-computing the optimal quantiles once, assuming the weights follow a zero-mean unit-variance normal distribution after normalisation:

# Conceptual sketch
quantiles = [Q(p) for p in linspace(0, 1, 17)]   # 17 boundaries => 16 bins
nf4_codebook = [(quantiles[i] + quantiles[i+1]) / 2 for i in range(16)]
nf4_codebook = nf4_codebook / max(abs(nf4_codebook))  # scale to [-1, 1]

At quantisation time, each weight tensor is scaled so that its absolute maximum maps to 1.0, then each value is rounded to the nearest entry in the fixed 16-point NF4 codebook. The scale factor (one float16 per block of 64 weights by default) is stored separately. Dequantisation multiplies the looked-up codebook value by the stored scale. The total storage per weight is 4 bits for the index plus 16 bits/64 = 0.25 bits for the scale, giving roughly 4.25 bits per parameter.

Double quantisation: quantising the quantisation constants

The scale factors themselves are full float16 values. With a block size of 64, you need one scale per 64 weights, which is 16 bits / 64 = 0.25 bits per weight. That sounds cheap, but across a 65B model it accumulates to roughly 512 MB of scale-factor overhead.

Double quantisation applies a second round of quantisation to those scale constants:

Collect all block-level scale factors across a larger "super-block" (typically 256 weights wide).
Quantise those scales to 8-bit integers, storing one float32 super-scale per super-block.
The super-scale is tiny: one float32 per 256 weights is 32/256 = 0.125 bits per weight.

Why the weight distribution matters for quantisation

Double quantisation: quantising the quantisation constants

Keep reading with Pro.