The CUDA Programming Model

A single NVIDIA A100 has 6,912 CUDA cores. Running a naive matrix multiply that keeps only a handful of them busy is not a GPU problem - it is a programming model problem. Understanding why requires internalising how CUDA maps your code onto hardware, not just what the API calls look like.

The Execution Hierarchy

CUDA exposes a three-level structure:

Level	Hardware counterpart	Typical count
Grid	Whole GPU	1 per kernel launch
Block (CTA)	Streaming Multiprocessor (SM)	Hundreds to thousands
Thread	CUDA core	32 per warp (hardware scheduling unit)

You write a kernel - a C++ function annotated __global__ - and launch it with a grid of blocks, each block containing a fixed number of threads. The runtime maps each block to one SM; threads within a block share fast on-chip shared memory and can synchronise with __syncthreads(). Threads in different blocks cannot communicate during execution.

The fundamental scheduling unit is the warp: 32 threads that execute the same instruction in lockstep (SIMT). Branch divergence - when threads in the same warp take different paths - serialises those paths, each disabled thread sitting idle. Half your warp hitting an if branch you only wanted for edge cases costs you half your warp's throughput.

__global__ void add(float* a, float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

// Launch: 1024 blocks of 256 threads each = 262,144 threads total
add<<<1024, 256>>>(a, b, c, n);

The if (i < n) guard only diverges in the last block if n is not a multiple of 256 - acceptable. A guard deeper in a loop, keyed on per-thread data, can be catastrophic.

Memory: The Hierarchy That Actually Determines Speed

Compute is rarely the bottleneck. Memory bandwidth is. CUDA exposes several memory spaces with very different costs:

Registers (~0 cycles, per-thread)
  L1 / Shared memory (~5 cycles, per-block)
    L2 (~30 cycles, per-SM cluster)
      Global / HBM (~200-400 cycles, whole GPU)
        Host (PCIe, ~microseconds per transfer)

Global memory coalescing is the single most impactful rule: if the 32 threads of a warp each access a consecutive 4-byte word, the hardware fuses all 32 accesses into one or two 128-byte transactions. If each thread jumps to a random address, you get 32 separate transactions. A 16x bandwidth penalty for the same number of arithmetic operations.

Shared memory tiling is the canonical response. Load a tile of data from global memory cooperatively (all threads in a block load one element each, coalesced), perform many arithmetic operations on it in shared memory, then write results back. Classic matrix multiply with shared memory tiling:

// Each block computes a TILE_SIZE x TILE_SIZE output sub-matrix.
// Load A's tile and B's tile into shared memory, then accumulate.
__shared__ float As[TILE][TILE];
__shared__ float Bs[TILE][TILE];

for (int t = 0; t < K/TILE; t++) {
    As[ty][tx] = A[row * K + (t*TILE + tx)];
    Bs[ty][tx] = B[(t*TILE + ty) * N + col];
    __syncthreads();
    for (int k = 0; k < TILE; k++) sum += As[ty][k] * Bs[k][tx];
    __syncthreads();
}

The ratio of arithmetic operations to global memory accesses is called arithmetic intensity. Tiling raises it from near-zero to (TILE/2) FLOPs per byte - and raising that ratio is what separates memory-bound from compute-bound kernels.

Occupancy and Latency Hiding

GPUs hide memory latency through occupancy: keeping many warps resident on each SM so that when one warp stalls waiting for memory, the SM instantly switches to another. No OS context switch, no register save - the SM holds all resident warps' registers simultaneously.

Occupancy is the fraction of maximum resident warps that your launch configuration achieves. It is limited by three resources:

Registers per thread. The SM has a fixed register file (~65k 32-bit registers on Ampere). More registers per thread means fewer warps can share that file.
Shared memory per block. The SM has ~100 KB on-chip. Large shared memory tiles block more warps from fitting.
Block size. Too-small blocks (e.g., 32 threads) leave blocks under-subscribed; too-large blocks limit how many blocks fit per SM.

High occupancy is not always necessary - if your kernel has very high arithmetic intensity and few memory stalls, lower occupancy can be fine. But for memory-bound kernels, occupancy is the lever.

Kernel Fusion: Cutting Launch Overhead

Each kernel launch has overhead: the host sends a command to the GPU, the driver queues it, and the hardware schedules it. For small operations (layer normalisation, activation functions, element-wise adds), this overhead can dominate.

Kernel fusion writes a single kernel that performs multiple logical operations. A standard example: fusing bias-add + layer norm + GELU activation into one kernel pass. Instead of writing to and reading from global memory three times, you do all three operations on data held in registers or shared memory.

Modern frameworks automate this. PyTorch 2.0's torch.compile with the Inductor backend analyses your model's computation graph, identifies fusable operation sequences, and emits fused Triton kernels. Testing across 163 open-source models, PyTorch reported a 43% average training speedup on an A100 from compilation alone - most of that from fusion eliminating memory round-trips.

CUDA Graphs: Eliminating CPU Overhead at Inference

For fixed-topology workloads (typical inference), even the per-iteration CPU-to-GPU command submission adds latency. CUDA Graphs capture an entire sequence of kernel launches - including their dependencies and memory transfers - as a single graph object. Subsequent iterations replay the captured graph with one API call, removing the CPU from the critical path.

The catch: the graph is static. Tensor shapes and data pointers must match the captured run. Dynamic inputs (variable sequence lengths, conditional branches based on input values) require re-capturing or separate graphs. This makes CUDA Graphs primarily useful for fixed-shape inference serving, where they can reduce iteration latency by 10-30% depending on kernel count.

When It Falls Down

Irregular workloads. Sparse operations, graph neural networks, dynamic sequence lengths - these create uneven work distributions. Some warps finish early and sit idle; some blocks are much heavier than others. The rigid grid-block-thread hierarchy offers no built-in work stealing. Cooperative Groups (a CUDA extension) helps, but sparse workloads still often underperform their dense counterparts by a factor of several times.

Shared memory bank conflicts. Shared memory is divided into 32 banks (on modern CUDA). If multiple threads in a warp access different addresses that map to the same bank, accesses serialise. A classic trap: a 32-column 2D shared memory array where threads access column-major - every thread hits the same bank. Padding the array by one column often cures it.

Register spilling. If a kernel uses more registers than the hardware allows per thread, the compiler spills to local memory - which resides in global memory. A kernel that looked compute-bound becomes memory-bound. Inspecting compiled code with nvcc --ptxas-options=-v shows register usage; reducing kernel complexity or splitting the kernel is the fix.

Occupancy-performance mismatch. The CUDA occupancy calculator is a guide, not a rule. Kernels with very high arithmetic intensity (like tensor-core GEMM) run well at low occupancy because the cores are always busy. Optimising occupancy at the cost of shared memory tile size can reduce performance for compute-bound kernels.

PCIe transfers hiding. All the GPU-side optimisation is irrelevant if you transfer data between host and device naively. PyTorch operations that trigger a .item() call on a tensor (or any implicit device synchronisation) stall the entire GPU pipeline while the CPU waits.