← Concept library

Applied LLMs

Autotuning GPU Kernels

Autotuning systematically searches a discrete configuration space of tile sizes, warp counts, and pipeline stages to find the fastest kernel for a given GPU and problem shape, replacing manual heuristics with empirical benchmarking.

advanced · 9 min read · Premium

A single matrix-multiplication kernel can run at 40% peak FLOPS or 95% peak FLOPS on the same GPU, with the only difference being the choice of three integers: the tile height, tile width, and the K-dimension block size. Getting those wrong does not crash the program; it just silently hemorrhages throughput. That gap is the problem autotuning solves.

Why static heuristics break down

GPU performance is not a smooth function of problem size. It is lumpy because of discrete hardware resources: shared-memory banks, register file partitions, warp schedulers, L2 cache sets. A tile size of 128x128 with 4 pipeline stages may be optimal for a 4096x4096 matrix on an A100 but suboptimal on an H100 with more SRAM, and catastrophically slow on a workstation RTX 4090 with different register file pressure.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied