Quantisation - INT8, INT4, FP8

A 70B model in bf16 needs 140 GB just to hold weights, plus another 20-40 GB for KV cache and activations. That does not fit on a single H100 (80 GB). Quantisation - storing weights or activations in fewer than 16 bits - is how you shrink the model to fit, raise the effective batch size, and (often) speed up inference since memory bandwidth is the bottleneck. The question is not whether to quantise but which scheme costs you the least quality.

PTQ vs QAT

Post-training quantisation (PTQ) takes a trained fp16 model and converts it offline using a small calibration set (typically 128-1024 samples). No gradients, no training infra. This is what almost everyone ships.
Quantisation-aware training (QAT) simulates quantisation during training so the model learns to be robust to it. Higher quality at the same bit-width, especially below 4 bits, but it needs a full training run. Used for on-device models (Gemma 3 QAT, Phi-3-mini-4bit) where 1-2 points of MMLU matters and you control training.

In practice: PTQ for serving big models you fine-tuned; QAT for tiny models you ship to phones.

The activation outlier problem

Naive INT8 quantisation works fine for weights but breaks on activations. At scale (~6B+), a small number of channels develop activation values 10-100x larger than the rest (Dettmers et al., 2022). Round those channels to 8-bit integers and you lose them; round everything else to fit those channels and you lose all the useful precision. The classic LLM.int8() paper handled this by keeping outlier channels in fp16 and quantising the rest. Modern approaches are subtler:

SmoothQuant migrates the activation difficulty into the weights by per-channel rescaling: Y = (X / s) @ (s * W). The weights absorb the scale (they can take it, having no outliers of their own); activations become uniform and quantisable.
AWQ (Activation-aware Weight Quantisation) observes that not all weights matter equally - the ~1% of weight channels aligned with large activation channels are critical. AWQ protects those with per-channel scaling before quantisation.
GPTQ uses second-order information (the inverse Hessian of the layer's reconstruction error) to quantise weight columns one at a time while updating the rest to compensate. State of the art for 3-4 bit weight-only quantisation.

FP8 on H100 and Blackwell

FP8 came with Hopper and is rapidly becoming the default for serving. Two formats:

E4M3 (4 exponent bits, 3 mantissa): tighter range, more precision. Used for weights and activations.
E5M2 (5 exponent, 2 mantissa): wider range, less precision. Used for gradients.

Tensor cores process FP8 at 2x FP16 throughput on H100 and 2x again on Blackwell. Quality is generally indistinguishable from bf16 if you scale per-tensor (or better, per-block) and keep norms and final logits in higher precision. Llama-3, DeepSeek-V3, and most production deployments now serve in FP8 by default.

The trade-off table

Numbers below are illustrative ranges from published benchmarks (GPTQ, AWQ, SmoothQuant papers; vLLM and TensorRT-LLM reports) on Llama-class 7B-70B models. Your mileage varies with model, calibration set, and task.

Scheme	Bits (W/A)	Memory vs fp16	Decode speedup	MMLU drop
fp16 / bf16	16 / 16	1.0x	1.0x	0
FP8 (E4M3)	8 / 8	0.50x	1.6-2.0x	<0.5 pt
INT8 SmoothQuant	8 / 8	0.50x	1.5-1.8x	~0.5 pt
INT8 weight-only	8 / 16	0.55x	1.2-1.4x	~0.2 pt
INT4 GPTQ	4 / 16	0.30x	1.5-2.0x	1-2 pt
INT4 AWQ	4 / 16	0.30x	1.5-2.0x	0.5-1.5 pt
INT4 weight + KV INT4	4 / 4	~0.25x	2.0-2.5x	2-4 pt

A few patterns hold across all of these:

Weight-only quantisation is the safest first step. Decoding is bandwidth-bound, so even though the matmul still runs in fp16 after dequantising on the fly, you read 4x less from HBM.
Activation quantisation (the second number) is where quality degrades fastest because of outliers. Use SmoothQuant or FP8 to make it survivable.
KV cache quantisation is the next frontier. INT4 KV cache (KIVI, AWQ-KV) cuts cache memory 4x with 1-3 points of quality loss; worth it for long-context serving.

When to reach for which

First port of call: FP8 on H100/B100. If your hardware supports it, you get half the memory, ~2x throughput, and no measurable quality loss for free.
Need to fit a 70B+ on one GPU: INT4 weight-only (AWQ or GPTQ). Quality survives, latency improves, and the kernels are mature in vLLM, TensorRT-LLM, and llama.cpp.
CPU or Apple Silicon inference: GGUF Q4_K_M or Q5_K_M. llama.cpp's k-quants are calibrated for CPU/Metal kernels and dominate that niche.
Sub-4-bit (2-3 bit): only with QAT or QuIP#. PTQ below 4 bits gets ugly fast. Not yet a serving default.

When it falls down

Long-context reasoning models (the kind that produce 10k+ tokens of chain-of-thought) suffer more from quantisation than chat models because errors compound across many decoded tokens. Quality drops measured on short-answer benchmarks can understate the real impact.
Fine-tuning a quantised model mostly does not work. QLoRA (LoRA on top of frozen INT4 weights) is the usable path; full fine-tunes need to dequantise.
Per-tensor scaling on tiny calibration sets gives wildly different results across runs. Always calibrate with at least 256 in-distribution samples and check on held-out eval, not just perplexity.