The GPU Execution Model

A modern H100 GPU ships 528 billion FP16 operations per second. A naively written kernel will use perhaps 10% of that. The gap is not a compiler problem or a driver problem. It is a structural consequence of how GPUs schedule work, and it is completely predictable once you understand the execution model.

The thread hierarchy: grids, blocks, and warps

When you launch a CUDA kernel you specify a grid of blocks, each block containing up to 1024 threads. This three-level hierarchy maps onto hardware in a specific way that has concrete performance consequences.

Each block is assigned to exactly one Streaming Multiprocessor (SM). An H100 has 132 SMs; a block can never split across two SMs. Within the SM, the hardware subdivides each block into warps of exactly 32 threads. The warp is the fundamental unit of scheduling and execution - not the block, not the thread.

Grid
  └─ Block (assigned to one SM)
       └─ Warp (32 threads, the scheduler's atom)
            └─ Thread (its own registers; shares SMEM with its warp-mates)

An SM can hold many warps simultaneously (up to 64 on Ampere). This occupancy is not about parallelism for its own sake; it is the primary mechanism for hiding latency. While one warp waits on a memory load (400-800 cycle round-trip to HBM), the warp scheduler selects another warp that is ready to execute. No context switch cost is paid: each warp's registers stay live on chip. The SM is an in-order pipeline that hides latency through warp switching, not out-of-order execution.

SIMT execution and divergence

GPU threads execute in the Single Instruction, Multiple Thread (SIMT) model. All 32 threads in a warp execute the same instruction simultaneously on different data - think of it as SIMD with a thread-level programming model layered on top.

The implication is stark for control flow:

// This branch causes warp divergence
if (threadIdx.x % 2 == 0) {
    do_path_A();   // odd threads masked off, wasted cycles
} else {
    do_path_B();   // even threads masked off, wasted cycles
}

When threads in the same warp take different branches, the warp executes both paths serially with masking. Threads not on the active path sit idle. Worst case: a 32-way branch reduces effective throughput to 1/32. Divergence is therefore a first-class concern when writing GPU kernels, not an edge case.

Threads in different warps can diverge for free - each warp is independently scheduled. The rule is: divergence within a warp is expensive; divergence across warps is free.

The memory hierarchy and occupancy arithmetic

Each SM has a small, fast shared memory (SMEM) that threads in the same block can use as a manually managed L1 cache. On Ampere and Hopper, the combined L1/SMEM pool is 256 KB per SM and can be partitioned. Registers are even faster - but register pressure limits occupancy.

The number of warps an SM can simultaneously host is limited by whichever resource is exhausted first:

max_warps = min(
    SM_max_warps,                          # hardware ceiling (e.g. 64 on A100)
    floor(SM_registers / (regs_per_thread * 32)),
    floor(SM_smem_bytes / smem_per_block) * block_warps
)

This creates a tension. A kernel that uses more registers per thread can do more work without spilling to local memory, but high register usage reduces occupancy, which reduces latency-hiding ability. The right tradeoff depends on whether your kernel is latency-bound (needs occupancy) or compute-bound (needs registers).

A quick occupancy sanity check: if your kernel uses 64 registers per thread and the A100 has 65,536 registers per SM, maximum resident threads = 65536 / 64 = 1024 = 32 warps. That is 50% of the 64-warp maximum. Whether 50% occupancy is fine or catastrophic depends on your kernel's memory access pattern.

Tensor cores and the warp-level matrix operation

Standard CUDA cores execute one multiply-accumulate per thread per clock. Tensor Cores are a separate execution unit introduced in Volta that operate at the warp level: all 32 threads in a warp collectively feed a small matrix multiplication unit.

The canonical instruction on Ampere is wmma::mma_sync, which computes a 16x16x16 matrix multiply-accumulate in a single operation across the warp. The throughput gain is substantial: an A100 delivers 312 TFLOPS in TF32 using Tensor Cores versus roughly 19.5 TFLOPS in FP32 using CUDA cores.

The catch is that Tensor Cores have strict input requirements. Matrices must be contiguous in memory and aligned to 128-byte boundaries. A matrix with a non-power-of-2 leading dimension, or data spread across non-contiguous allocations, will fall back to scalar code. cuBLAS handles this transparently when you use its API; hand-rolled GEMM kernels must be deliberate about layout.

A rough rule for identifying Tensor Core eligibility:

Operation	Tensor Core candidate?
Matrix multiply (any shape >= 16x16)	Yes, if aligned and contiguous
Batch matrix multiply	Yes, with strided batch API
Element-wise ops (ReLU, add)	No
Reductions (softmax, layernorm)	No (but can use after GEMM)

When it falls down

Low occupancy with no fallback. If a kernel uses enormous shared memory (e.g., a large lookup table per block), only one or two blocks fit per SM simultaneously. With so few warps, a single global memory load stalls the entire SM. The fix is either to reduce SMEM usage, split into smaller blocks, or accept a roofline-limited throughput.

False warp divergence in batch processing. A common mistake in NLP kernels: branching on sequence length per sample. If samples in the same batch have very different lengths and your thread block is shaped to handle one sample, divergence is minimal. But if you tile sequences across threads within a warp, short sequences cause heavy masking.

Bank conflicts in shared memory. SMEM is divided into 32 banks (one per thread in a warp). If multiple threads access the same bank simultaneously (but different addresses), accesses are serialised. A classic case: accessing a column of a row-major matrix stored in SMEM causes every access to hit the same bank for each row stride that is a multiple of 32. The fix is padding the row by one element.

Register spilling. When a kernel exceeds the register file size, the compiler spills registers to local memory, which is physically L2/DRAM. Spill stores and loads look identical to global memory accesses from a latency standpoint. The Nsight Compute profiler reports l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum for this. Even a small spill rate can halve throughput in a compute-bound kernel.

Overestimating occupancy benefit. A kernel bound by arithmetic throughput (close to the roofline compute ceiling) gains nothing from higher occupancy. Occupancy helps only when the bottleneck is latency from memory accesses. Profiling before tuning occupancy avoids chasing the wrong metric.

The thread hierarchy: grids, blocks, and warps

SIMT execution and divergence

The memory hierarchy and occupancy arithmetic

Tensor cores and the warp-level matrix operation

When it falls down

Further reading