Autotuning GPU Kernels

A single matrix-multiplication kernel can run at 40% peak FLOPS or 95% peak FLOPS on the same GPU, with the only difference being the choice of three integers: the tile height, tile width, and the K-dimension block size. Getting those wrong does not crash the program; it just silently hemorrhages throughput. That gap is the problem autotuning solves.

Why static heuristics break down

GPU performance is not a smooth function of problem size. It is lumpy because of discrete hardware resources: shared-memory banks, register file partitions, warp schedulers, L2 cache sets. A tile size of 128x128 with 4 pipeline stages may be optimal for a 4096x4096 matrix on an A100 but suboptimal on an H100 with more SRAM, and catastrophically slow on a workstation RTX 4090 with different register file pressure.

Library authors (cuBLAS, FlashAttention) invest person-years writing hand-tuned schedules for a fixed set of shapes and hardware generations. That approach does not transfer. Every new GPU SKU, every new operator shape (e.g., the 4096x2048 projections in a 7B LLM vs. the 1x4096 decode step), and every change to mixed-precision strategy potentially invalidates the heuristic.

Autotuning frames this as a search problem:

Search space: a finite set of candidate configurations, each a tuple (BLOCK_M, BLOCK_N, BLOCK_K, num_warps, num_stages, ...).
Objective: wall-clock latency (or throughput) measured on the actual target hardware.
Policy: exhaustive grid search (Triton's default), Bayesian optimisation, evolutionary search, or learned cost models.

The key insight is that measurement beats prediction: a microsecond benchmark run during a one-time compile step is cheaper than a performance engineer.

The configuration space and its knobs

Understanding what you are searching over matters before understanding how to search.

Parameter	Effect	Typical range
`BLOCK_M`, `BLOCK_N`	Tile footprint in output matrix; controls reuse per load	32-256 (powers of 2)
`BLOCK_K`	Accumulation depth per inner loop; balances arithmetic intensity vs. shared-mem	16-128
`num_warps`	Warps per thread block; affects occupancy and register pressure	2-16
`num_stages`	Software pipeline depth; hides global-memory latency	1-7
`num_ctas`	Number of thread-block clusters (Hopper+)	1-4

These interact non-linearly. A large BLOCK_M * BLOCK_N tile increases arithmetic intensity (good) but also increases the register footprint per thread, which caps the number of thread blocks that can live simultaneously on a Streaming Multiprocessor (SM). That cap is occupancy, and low occupancy means the warp scheduler has fewer warps to hide memory latency behind. The optimal point depends on the ratio of arithmetic-to-memory latency for the specific GPU's memory subsystem.

A rough heuristic for tile size selection:

target_reuse = arithmetic_intensity(op) / peak_compute_throughput
               / peak_memory_bandwidth
# if target_reuse > 1: compute bound, maximise tile; else: memory bound, smaller tiles + fuse

Why static heuristics break down

The configuration space and its knobs

Keep reading with Pro.