Applied LLMs
Quantised GEMM Kernels
Quantised GEMM kernels replace 16-bit or 32-bit matrix multiplications with 8-bit or 4-bit integer arithmetic, cutting memory bandwidth and compute cost while preserving model accuracy through careful scaling and outlier handling.
advanced · 9 min read · Premium
A single forward pass of Llama-3 70B in FP16 moves roughly 140 GB of weight data through GPU memory each token. At 2 TB/s of HBM bandwidth, that is 70 ms/token before a single multiply-add has been computed. Switching to INT8 halves the transfer; INT4 quarters it. Quantised GEMM kernels are the mechanism that actually realises those savings at the hardware level, and getting them right is harder than changing a dtype flag.
What a GEMM kernel really does
A general matrix multiplication (GEMM) computes C = A × B, where the operands live in GPU global memory (HBM) and the result is written back. The kernel's job is to:
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.