Memory Coalescing

A NVIDIA A100 can deliver roughly 2 TB/s of memory bandwidth, yet a naive kernel that accesses global memory with a stride of 2 will achieve barely half that figure. The hardware is not broken; it is doing exactly what it was designed to do. Understanding why is the first step to writing kernels that are actually fast.

The warp and the cache line

The GPU executes threads in groups of 32 called warps. When a warp issues a load or store instruction, the hardware does not fire 32 independent 4-byte transactions. Instead, it collects the 32 addresses, sorts them, and tries to satisfy the entire warp in as few 128-byte (L2) or 32-byte (L1 sector) cache-line transactions as possible.

If the 32 threads in a warp access addresses that fall within the same aligned 128-byte region, one transaction suffices. If they scatter across 32 different cache lines, 32 transactions are needed and effective bandwidth drops by 32x. The hardware merges what it can, but it cannot re-arrange your data.

The rule is simple in principle: thread i in a warp should touch address base + i * element_size, where base is aligned to 128 bytes.

What coalescing looks like in practice

Consider a kernel that reads a row of a float matrix. Two indexing patterns, same data:

// Coalesced: thread i reads column i of row blockIdx.x
float val = A[row * N + threadIdx.x];

// Strided: thread i reads row i of column blockIdx.x
float val = A[threadIdx.x * N + col];

In the first pattern the 32 threads in a warp hit 32 consecutive floats (128 bytes), one transaction. In the second they hit 32 addresses separated by N * 4 bytes each. For a 4096-wide matrix that is a stride of 16 384 bytes between consecutive thread addresses - 32 distinct cache lines, 32 transactions.

The benchmark from NVIDIA's own best-practices guide (Tesla V100, HBM2) shows aligned sequential access reaching ~790 GB/s, while a stride-2 access pattern halves that, and stride-32 effectively reduces it to ~25 GB/s. The FLOP count is identical; only the memory pattern changed.

Tiling: coalescing when the algorithm forces non-sequential access

Matrix multiplication is the canonical case where naive access is non-coalesced on at least one operand. The standard fix is tiling through shared memory:

A tile of threads loads a rectangular block of A (coalesced, row-major) into shared memory.
The same tile loads a block of B (coalesced, row-major) into shared memory.
Threads compute the partial dot products entirely from shared memory, which has 32 banks and ~19 TB/s of aggregate bandwidth on A100.
Repeat across tiles.

The net effect is that each element of A and B is read from global memory exactly once per tile pass, and the reads are coalesced. The arithmetic-to-memory ratio (occupancy of compute units relative to memory transactions) improves roughly proportionally to the tile width.

Tiles loaded coalesced:    global → shared  (1 transaction per 32 threads per row)
Arithmetic on shared mem:  shared → registers (no global memory at all)

This is exactly what torch.nn.Linear and cuBLAS do internally. When you write a custom kernel in Triton, the tl.load with a 2-D block pointer takes care of tiling for you, but you are still responsible for laying the pointer arithmetic out so the load is contiguous in the innermost dimension.

Alignment and padding

Coalescing requires not only stride-1 access but also alignment. A 128-byte aligned base address means base % 128 == 0. CUDA's cudaMalloc guarantees 256-byte alignment. Problems arise when you:

Slice a tensor along an axis that produces a non-aligned row start (e.g. slicing A[1:] from a row-major matrix whose row is not a multiple of 128 bytes).
Use custom memory pools or shared-memory layouts with odd padding.
Cast a pointer from a wider type to a narrower one that shifts alignment.

PyTorch's Tensor.contiguous() and torch.compile will sometimes insert copies to restore alignment. When profiling, an unexpected memcpy in the trace is often a sign that alignment was lost.

Vectorised loads

Modern CUDA allows loading 4 or 8 bytes per thread per instruction (float4, int4, uint2). A warp of 32 threads issuing float4 loads moves 32 * 16 = 512 bytes per instruction. This keeps the memory bus wider and reduces instruction overhead. Triton generates vectorised loads automatically when the block size is a multiple of 16 elements. In hand-written CUDA you opt in explicitly:

// Load 4 floats at once; requires base + threadIdx.x*4 to be 16-byte aligned
float4 v = reinterpret_cast<float4*>(ptr)[threadIdx.x];

Profiling with ncu (NVIDIA's Nsight Compute) will report l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum versus l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum. A sectors/requests ratio of 1 means perfectly coalesced; a ratio of 32 means every thread caused its own transaction.

When it falls down

Structure-of-arrays vs. array-of-structures. If your data is laid out as struct { float x, y, z; } pts[N], reading only the x fields means stride-3 access. Convert to SoA (float xs[N], ys[N], zs[N]) or accept the penalty. Transformer attention heads can hit this when head-dimension is not the innermost axis.

Batch sizes of 1 at inference. With a single sequence, the warp may not have 32 useful threads, leaving most lanes inactive. The memory requests are issued but only a fraction carry useful data - effective bandwidth per useful byte plummets. Padding or speculative batching (as used in continuous batching schedulers like vLLM) is the operational fix.

Bank conflicts in shared memory. Coalescing governs global memory. Shared memory has a separate but analogous concern: 32 banks, accessed in parallel. If all 32 threads in a warp address the same bank (e.g. column-access of a row-major shared tile without padding), the accesses serialise. Padding shared arrays by one element per row is the classical fix.

Reductions with sequential addressing. A naive parallel sum where thread i adds elements 0, 2, 4, ... (doubling stride each iteration) is non-coalesced in early rounds. Reversing the addressing pattern to keep active threads contiguous fixes both coalescing and warp divergence simultaneously.

Non-power-of-two tensor dimensions. If a matrix row has 1000 floats (4000 bytes), rows start at offsets 0, 4000, 8000 ... none of which (except the first) is 128-byte aligned. Either pad to the next multiple of 32 floats, or accept that some transactions will be partial. cuBLAS pads internally; user-written kernels often do not.

Half-precision gotcha. float16 and bfloat16 are 2 bytes. A warp load of 32 half-precision values covers only 64 bytes, half a cache line. The effective coalescing efficiency per transaction is lower than for float32. Vectorised loads (float4 treating two fp16 values as one float) recover this.