The GPU Memory Hierarchy

A single H100 SXM GPU can push 3.35 TB/s of bandwidth to its on-package HBM3 memory. A single DDR5 CPU channel delivers roughly 50 GB/s. That 60x gap tells you why models run on GPUs, but it also hides a subtler story: inside the GPU itself, the memory hierarchy spans another three or four tiers, and the difference between a 10 TFLOP/s kernel and a 300 TFLOP/s kernel on the same hardware is almost always a question of which tier you are reading from.

The tiers, from fastest to slowest

Every CUDA-capable GPU arranges memory into roughly four levels. The numbers below are for the H100 SXM; older architectures scale proportionally.

Tier	Capacity per SM	Approx. bandwidth	Latency
Registers	~256 KB (64k × 32-bit regs)	~20 PB/s aggregate	~1 cycle
L1 / Shared memory	228 KB (H100), configurable split	~30 TB/s aggregate	~20-30 cycles
L2 cache	50 MB (H100), chip-wide	~12 TB/s	~200 cycles
HBM (global memory)	80 GB (H100 SXM)	3.35 TB/s	~600-700 cycles

Registers are private to a single thread: no sharing, no addressing by pointer. Shared memory (__shared__ in CUDA) is explicitly managed scratchpad shared among all threads in a block. L2 is a hardware-managed cache across all SMs. HBM (high-bandwidth memory) is the main DRAM sitting on the same package, connected via wide silicon interposers.

The key insight is that only shared memory is under programmer control. L1 and L2 have hardware policies; registers are allocated by the compiler. A kernel that touches HBM for every arithmetic operation is wasting the vast majority of what the chip can theoretically do.

Why the hierarchy exists at all

You cannot build a chip where everything is as fast as a register file. SRAM cells are large and power-hungry; fitting even 10 MB of register-equivalent storage on die at register speeds would consume a disproportionate fraction of the chip area and wattage budget. HBM trades latency for density: stacked DRAM dies bonded directly to the GPU give enormous capacity (up to 80-188 GB on H100 variants) at modest cost per bit.

The practical consequence is the roofline model. For any kernel, compute performance is bounded by either:

Peak FLOP/s throughput (compute-bound), or
Memory bandwidth (memory-bound).

The arithmetic intensity threshold at which you cross from memory-bound to compute-bound is:

AI_threshold = Peak FLOP/s / Peak Memory BW

For an H100 SXM doing FP16 Tensor Core operations (~989 TFLOP/s) against HBM (3.35 TB/s):

AI_threshold ≈ 989e12 / 3.35e12 ≈ 295 FLOPs per byte

A matrix multiply of large square matrices has arithmetic intensity well above 295; it is compute-bound and can approach Tensor Core peak. A simple elementwise activation function (ReLU, GELU) processes one float per memory access, giving arithmetic intensity of 0.25 FLOPs/byte - it is deeply memory-bound and its speed scales with HBM bandwidth, not FLOP/s.

Shared memory as a programmer-controlled L1

When a kernel loads a tile of data from HBM into shared memory, uses it many times for computation, then discards it, it is performing tiling: exploiting locality to amortise the expensive HBM access across many cheap shared-memory accesses.

A simplified tiled matrix multiply looks like this:

# Pseudocode for a tile of size T×T
for k_tile in range(K // T):
    # Load one tile from A and B into shared memory (one HBM round-trip per tile)
    smem_A[ty][tx] = A[row][k_tile * T + tx]
    smem_B[ty][tx] = B[k_tile * T + ty][col]
    __syncthreads()
    # Compute T MACs using shared memory (no HBM traffic)
    for k in range(T):
        acc += smem_A[ty][k] * smem_B[k][tx]
    __syncthreads()

Each element of smem_A and smem_B is loaded once from HBM and reused T times in the inner loop. For T = 128, that is 128x reduction in HBM traffic relative to a naive implementation. This is the mechanism behind NVIDIA's cuBLAS and the reason that carefully tuned GEMM kernels approach Tensor Core peak performance.

FlashAttention exploits exactly this principle for attention: rather than materialising the full N×N attention matrix in HBM (O(N²) memory), it tiles the Q, K, V matrices through shared memory, keeping intermediate results on-chip. The result is that HBM traffic scales with N rather than N², and the algorithm becomes IO-optimal for a wide range of SRAM sizes (as formally proven in Dao et al., 2022).

Bank conflicts and the hidden cost of shared memory

Shared memory is divided into 32 banks (on modern NVIDIA hardware), each 4 bytes wide. When multiple threads in a warp address different locations that map to the same bank, those accesses serialise. A warp that would otherwise complete a shared-memory read in one cycle might take 32 cycles if every thread hits a different address in the same bank.

The classic failure case is column-major access in a row-major tile:

// Naive: threads 0..31 access column 0 of a 32-column tile
// All map to bank 0 -> 32-way conflict
float val = smem[threadIdx.x][0];  // conflict

// Fix: transpose the tile on load, or pad columns by 1
float smem_padded[32][33];         // column 33 never used; shifts bank mapping

Padding by one element breaks the bank alignment and eliminates the conflict at the cost of 3% extra shared memory. Most production GEMM kernels do this implicitly.

When it falls down

Register spilling. Each SM has a fixed register file (65,536 32-bit registers on Hopper). If a kernel requires more registers per thread than available (given the chosen block size), the compiler spills registers to local memory, which is a region of global HBM. Latency jumps from 1 cycle to ~600 cycles. Profiling with ncu --set full and checking the "Warp Stalls: Long Scoreboard" metric will expose this. Reducing thread-level register pressure or launching smaller blocks are the two remedies.

Shared memory occupancy limits. More shared memory per block means fewer concurrent blocks per SM (occupancy drops). A kernel that allocates 228 KB of shared memory occupies an entire SM on H100, precluding any latency hiding from warp switching. The right tradeoff is workload-dependent: memory-bound kernels often benefit from high occupancy over large tiles; compute-bound kernels care less.

L2 thrashing. The 50 MB L2 is chip-wide. If many SMs simultaneously access distinct 50+ MB regions of HBM, the L2 provides no benefit. Persistent kernel patterns and streaming accesses with appropriate cache bypass hints (__ldg, ld.global.cs in PTX) can mitigate this, but the L2 is not a panacea for poor access patterns.

HBM bandwidth saturation across concurrently running kernels. A popular misconception is that a single small kernel can get 3.35 TB/s. In practice, HBM bandwidth is shared among all resident kernels, and a kernel with poor coalescing (scattered 32-byte transactions instead of 128-byte cache lines) can saturate the memory controller at a fraction of peak effective bandwidth. Coalesced access - where consecutive threads in a warp access consecutive addresses - is a prerequisite for approaching HBM peak.

Tensor Core latency hiding. Tensor Cores have high throughput but non-trivial latency. They rely on warp-level instruction-level parallelism and double-buffering (loading the next tile into shared memory while computing on the current tile). Kernels that fail to pipeline the load and compute stages pay the full HBM latency before each Tensor Core operation.

The tiers, from fastest to slowest

Why the hierarchy exists at all

Shared memory as a programmer-controlled L1

Bank conflicts and the hidden cost of shared memory

When it falls down

Further reading