← Concept library

Applied LLMs

Quantised GEMM Kernels

Quantised GEMM kernels replace 16-bit or 32-bit matrix multiplications with 8-bit or 4-bit integer arithmetic, cutting memory bandwidth and compute cost while preserving model accuracy through careful scaling and outlier handling.

advanced · 9 min read · Premium

A single forward pass of Llama-3 70B in FP16 moves roughly 140 GB of weight data through GPU memory each token. At 2 TB/s of HBM bandwidth, that is 70 ms/token before a single multiply-add has been computed. Switching to INT8 halves the transfer; INT4 quarters it. Quantised GEMM kernels are the mechanism that actually realises those savings at the hardware level, and getting them right is harder than changing a dtype flag.

What a GEMM kernel really does

A general matrix multiplication (GEMM) computes C = A × B, where the operands live in GPU global memory (HBM) and the result is written back. The kernel's job is to:

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied