Quantised GEMM Kernels

A single forward pass of Llama-3 70B in FP16 moves roughly 140 GB of weight data through GPU memory each token. At 2 TB/s of HBM bandwidth, that is 70 ms/token before a single multiply-add has been computed. Switching to INT8 halves the transfer; INT4 quarters it. Quantised GEMM kernels are the mechanism that actually realises those savings at the hardware level, and getting them right is harder than changing a dtype flag.

What a GEMM kernel really does

A general matrix multiplication (GEMM) computes C = A × B, where the operands live in GPU global memory (HBM) and the result is written back. The kernel's job is to:

Tile the matrices into blocks that fit in on-chip SRAM (shared memory on CUDA devices).
Load a tile of A and a tile of B cooperatively across the thread block.
Accumulate partial dot products in registers.
Write the result tile back to global memory.

In FP16 GEMM, both storage and accumulation can happen at the same precision. In quantised GEMM the workflow splits:

Stage	Typical dtype
Weight storage (on disk / HBM)	INT4 or INT8
Dequantisation	FP16 or BF16
Matrix multiply (tensor core accumulator)	FP16/BF16 or INT32
Output / residual	FP16 or BF16

The key insight is that GPU tensor cores on Ampere/Hopper can execute INT8 × INT8 → INT32 natively at roughly 2x the throughput of FP16 (same number of tensor core cycles, but twice as many INT8 elements fit per cycle). INT4 support varies by generation: Ada and Hopper expose FP8 tensor cores, which is often more practical than INT4 for inference.

The dequantisation bottleneck

Naive quantisation stores every weight as INT8 and multiplies by a single global scale factor before the GEMM. This fails in practice for two reasons.

Outlier features. From roughly 6.7B parameters upward, transformer hidden states develop systematic high-magnitude channels (outlier features). If you quantise activations with a global scale chosen to cover the outlier, the other 99% of values are crushed into a handful of quantisation bins, destroying accuracy. LLM.int8() (Dettmers et al., 2022) handles this by decomposing each matrix multiplication: the outlier columns are isolated and kept in FP16, and the remaining 99.9%+ of values are multiplied in INT8. The two partial products are summed in FP16.

Per-channel vs. per-tensor scaling. A single scale per tensor wastes range. Per-channel (or per-group) scales let each output channel choose its own dynamic range. Most production kernels use group quantisation with a group size of 64 or 128 weights sharing one scale and zero-point:

# Conceptual layout: weight shape [K, N], group_size = 128
# scales shape: [K // 128, N]
# actual weight (decompressed):
w_fp16[k, n] = (w_int4[k, n] - zero_point[k // 128, n]) * scale[k // 128, n]

What a GEMM kernel really does

The dequantisation bottleneck

Keep reading with Pro.