Occupancy and Latency Hiding

A modern H100 SXM5 GPU has 132 streaming multiprocessors (SMs), each capable of holding up to 64 resident warps simultaneously. That is 8,448 warps - roughly 270,000 threads - in flight at once on a single chip. The hardware does not wait for a memory load to return. It simply runs a different warp. This mechanism, latency hiding through warp switching, is the reason GPUs sustain far more useful work per second than their raw DRAM latency would suggest.

What occupancy actually measures

Occupancy is a ratio:

occupancy = active warps per SM / maximum warps per SM

On an A100, the maximum is 64 warps per SM (2,048 threads). If your kernel launches blocks of 128 threads (4 warps each) and only 8 blocks fit on an SM due to register pressure, you have 32 active warps - 50% occupancy.

The number is not intrinsically good or bad. It is a proxy for latency-hiding capacity. An SM's warp scheduler issues one instruction per warp per clock only when that warp is ready (no pending memory transaction, no data dependency). At any given cycle, most warps in a real kernel are stalled waiting for L2 or DRAM. If only a few warps are resident, the scheduler has little to switch to, and the SM stalls with its execution units idle. If many warps are resident, chances are high that at least one is ready to issue - the scheduler picks it and the stall disappears from the critical path.

The global memory latency on an A100 is roughly 400-600 cycles. At 50% occupancy with 32 resident warps, you could in principle cover 32 independent memory requests before the first one returns. Whether that is enough depends entirely on the ratio of arithmetic to memory operations in your kernel - the arithmetic intensity.

The three resource limits that cap occupancy

Three SM-level resources constrain how many blocks can be scheduled concurrently, and therefore set a ceiling on occupancy:

Resource	Typical A100 limit per SM	How it kills occupancy
Registers	65,536 per SM	Each thread uses N registers; blocks requiring more than 65,536 / N threads cannot co-reside
Shared memory	164 KB configurable per SM	Blocks declaring large shared buffers leave no room for others
Thread slots	2,048 per SM	Block count x block size cannot exceed this regardless of register/smem

Register pressure is the most common culprit for low occupancy. A kernel with 128 registers per thread can fit at most 512 threads (16 warps) on an SM regardless of shared memory or block size. The NVIDIA occupancy calculator and the cudaOccupancyMaxActiveBlocksPerMultiprocessor API expose exactly this calculation so you can explore the trade-off without profiling blind.

A useful rule of thumb: compile with -maxrregcount=N to cap register usage and raise occupancy, but watch instruction-level latency rise as the compiler spills registers to local (DRAM-backed) memory. The trade-off is real.

Why latency hiding requires independent work, not just thread count

Warp switching covers latency only if the waiting warp's result is not immediately needed by another warp. Consider a kernel where each warp depends on a value loaded by a neighbouring warp: the scheduler switches, the new warp issues a load, then tries to execute an instruction that depends on the previous warp's still-missing result, stalls again. You now have two stalled warps instead of one.

The genuine requirement for latency hiding is independent memory transactions - either across warps (inter-warp independence, the normal case) or within a single warp via instruction-level parallelism. When a GEMM kernel tiles a large matmul across shared memory, each warp in the block accesses a different portion of the tile; none depends on another warp's load result, so 32 independent L2 requests can be in flight per SM simultaneously.

This is also why attention kernels with large head_dim often run at lower effective occupancy but still hit high memory bandwidth: the arithmetic intensity is high enough that most cycles are spent in the tensor cores, not waiting on DRAM, and fewer warps are needed to keep execution units busy.

Shared memory as a second dial

Shared memory and occupancy interact in a non-obvious way. Declaring more shared memory per block reduces the number of co-resident blocks, directly lowering occupancy. Yet using shared memory to stage data - loading a tile once from global memory and reusing it many times - reduces total global memory traffic, which lessens how much latency needs to be hidden in the first place.

The Ampere and Hopper architectures let you partition the L1/shared memory space at runtime via cudaFuncSetAttribute. A kernel doing heavy tiling benefits from cudaSharedmemCarveoutMaxShared; a kernel doing irregular access benefits from cudaSharedmemCarveoutMaxL1. Choosing the wrong split can cost 20-30% throughput without any change to the kernel logic itself.

A concrete example: FlashAttention-2 runs at roughly 50-70% of theoretical FLOPs on an A100 not because occupancy is low, but because the tiling strategy ensures the attention computation is compute-bound rather than memory-bound. Fewer resident blocks are needed because the blocks that do run issue sustained tensor-core work, not a burst of loads followed by idle stalls.

When it falls down

Compute-bound kernels do not benefit from higher occupancy. If your SM's arithmetic units are saturated, adding more resident warps just queues more work without reducing latency. Profiling with ncu will show sm__warps_active.avg.pct_of_peak_sustained_active near 100%; forcing more occupancy here provides no improvement and may introduce shared memory or register bank conflicts that hurt performance.

Small, short-lived kernels waste the ramp-up. A kernel that runs for fewer than a few hundred cycles never achieves steady-state warp residency. The launch overhead and the pipeline fill dwarf any latency-hiding benefit. PyTorch's CUDA graph API mitigates this by amortising kernel launches, but the underlying occupancy problem remains for kernels whose grid size is smaller than the SM count.

Register spilling can reverse the occupancy gain. Forcing -maxrregcount=32 on a kernel that naturally needs 80 registers will raise occupancy but generate local memory loads (which are global memory reads with a cache hint). Now the SM has more warps in flight, each generating more memory traffic, potentially saturating DRAM bandwidth and making things worse.

Uncoalesced access patterns nullify warp switching. If each thread in a warp accesses a non-contiguous address, the hardware issues multiple memory transactions per warp instruction (up to 32 on a cache miss). High occupancy means many such expensive transactions in flight simultaneously, which can saturate the memory subsystem faster than a kernel with lower occupancy and coalesced access.

Persistent kernels and cooperative groups change the model. With persistent kernel designs (common in fused attention implementations), a single kernel occupies all SMs for the duration of inference. Occupancy in the traditional sense becomes irrelevant; the scheduling granularity shifts to work-items dispatched inside the kernel, and the hardware's warp-switching mechanism is replaced by software-managed work queues.

What occupancy actually measures

The three resource limits that cap occupancy

Why latency hiding requires independent work, not just thread count

Shared memory as a second dial

When it falls down

Further reading