Applied LLMs
NF4 and Double Quantisation
NF4 is a 4-bit data type matched to the normal distribution of pretrained weights, and double quantisation further compresses the quantisation constants themselves, together enabling 65B-parameter models to fine-tune on a single 48 GB GPU via QLoRA.
advanced · 8 min read · Premium
A 65-billion-parameter model needs roughly 130 GB in BFloat16. A single H100 SXM holds 80 GB. Without quantisation, fine-tuning that model is physically impossible on one GPU. NF4, introduced in the QLoRA paper (Dettmers et al., 2023), closes that gap not by approximating the model crudely, but by exploiting a structural fact about how pretrained weights are distributed: they are almost always approximately Gaussian.
Why the weight distribution matters for quantisation
Standard 4-bit integer formats (INT4) map values to equally spaced bins across a fixed range. If your weights are clustered near zero with light tails, most of those 16 bins are wasted on the extremes where almost no weight values live. You are paying 4 bits per weight but using the resolution of perhaps 2 or 3 effective bits.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.