Shared Memory and Tiling

Global memory bandwidth on an H100 is roughly 3.35 TB/s. That sounds enormous until you realise a single matrix-multiply at FP16 on the same chip can demand ten to twenty times that bandwidth if every thread fetches its own operands independently. The arithmetic does not move fast; the data does not arrive fast enough. Shared memory and tiling exist to close that gap.

The memory hierarchy in brief

A CUDA GPU has several levels of memory, each with a different capacity/latency trade-off:

Level	Capacity (per SM)	Latency
Registers	~256 KB	1 cycle
Shared memory (SRAM)	32-228 KB (configurable)	~5-30 cycles
L1/L2 cache	automatic	~30-100 cycles
Global memory (HBM)	GBs	~600-800 cycles

Shared memory sits between registers and the L1 cache in the hierarchy. It is explicitly managed: the programmer loads data into it, synchronises threads, computes, and then writes results back to global memory. The GPU does not do this automatically. That explicit control is both the power and the responsibility.

On modern architectures (Ampere, Hopper) shared memory and L1 cache share the same physical SRAM, and you can partition the split at runtime. A kernel that wants more scratch space calls cudaFuncSetAttribute to tilt the balance toward shared memory, at the cost of a smaller automatic cache.

Tiling: the reuse argument

Consider a matrix multiplication C = A * B where each thread computes one element of C. Naively, thread (i, j) loads row i of A and column j of B entirely from global memory. Two threads in the same warp that happen to work on adjacent columns of C both need the same row of A. Without tiling they each fetch it separately. With tiling, a whole block of threads cooperates to load a small rectangular sub-matrix ("tile") of A and the corresponding tile of B into shared memory, and then every thread in the block reads from shared memory for the duration of that tile's dot product.

The saving is proportional to the tile size. A tile of BLOCK x BLOCK threads reading a BLOCK x BLOCK sub-matrix of A means each element of A is loaded from global memory once but read BLOCK times from shared memory. For BLOCK=16 that is a 16x reduction in global memory traffic for A (and symmetrically for B).

A sketch of the inner loop:

// Each thread block owns a BLOCK×BLOCK tile of C.
// Iterate over K in steps of BLOCK.
for (int k = 0; k < K; k += BLOCK) {
    // Cooperatively load tile of A and tile of B into __shared__.
    As[ty][tx] = A[row * K + (k + tx)];
    Bs[ty][tx] = B[(k + ty) * N + col];
    __syncthreads();          // barrier: all threads see the full tile

    // Accumulate the partial dot product from shared memory.
    for (int i = 0; i < BLOCK; i++)
        acc += As[ty][i] * Bs[i][tx];

    __syncthreads();          // barrier before next tile overwrites As/Bs
}

The two __syncthreads() calls are mandatory. The first ensures every thread has finished writing before any thread starts reading. The second ensures every thread has finished reading before the tile is overwritten. Omitting either is a data race that produces silently wrong results.

Bank conflicts and how to avoid them

Shared memory is divided into 32 equally sized banks (on compute capability 5.x and later), each 4 bytes wide, arranged in a round-robin pattern across addresses. When all 32 threads in a warp access 32 distinct banks simultaneously, the accesses are served in parallel in one cycle. When two or more threads in the same warp access the same bank (at different addresses within that bank), the accesses are serialised: a 2-way conflict halves throughput, a 32-way conflict reduces it to 1/32.

The classic bank-conflict trap is a column-wise read of a row-major shared memory array. If Bs is declared as float Bs[BLOCK][BLOCK] with BLOCK=32, then threads 0..31 reading Bs[0][0], Bs[1][0], ..., Bs[31][0] all land on bank 0. That is a 32-way conflict.

The standard fix is to add one element of padding to the column dimension:

__shared__ float Bs[BLOCK][BLOCK + 1];

This shifts successive rows by one element, scattering the column reads across different banks. The padding wastes 4 bytes per row but eliminates the conflict. NVIDIA's matrix transpose example reaches 99.5 GB/s on a K20c with this single change.

Tiling in practice: Triton and cuBLAS

Writing tiled CUDA kernels by hand is tedious and error-prone. Two abstractions raise the level:

Triton expresses tiles as first-class objects. You declare a block of pointers, load a tile with tl.load, compute on it, and store. The compiler handles the __syncthreads, the bank-conflict analysis, and the register pressure. Triton's matrix multiplication tutorial shows a kernel that reaches cuBLAS performance in roughly 40 lines of Python-syntax kernel code, without explicit shared memory declarations.

cuBLAS / cuDNN do all of this for you for standard operations. Their tiling strategies are tuned per-architecture, per-precision, and per-shape. For anything that fits their interface, use them. For novel operations (custom attention variants, fused activations, irregular shapes), you drop to Triton or CUDA.

When it falls down

Occupancy collapse from large tiles. Shared memory is a finite resource per streaming multiprocessor (SM). If your tile size is large, few thread blocks fit on the SM simultaneously. Low occupancy means the GPU cannot hide latency by switching between warps. The sweet spot between tile reuse and occupancy is workload-specific and often requires empirical tuning.

Small or irregular problem sizes. Tiling works best when K (the reduction dimension) is large and divisible by BLOCK. When it is not, you need boundary guards (if (k + tx < K)) that introduce predicated loads, reducing throughput and increasing code complexity. Very small matrices (e.g., batch size 1 at inference) may see no benefit from tiling and are sometimes better served by a different kernel shape entirely.

Non-power-of-two tile sizes and register pressure. Increasing BLOCK to 32 or 64 can reduce global memory traffic further, but each thread now holds more partial sums in registers. When register usage per thread rises, the number of resident warps per SM falls again, hurting latency hiding. Profiling with ncu --set full is the only reliable way to see where the bottleneck actually sits.

Dynamic shared memory allocation. If the kernel uses extern __shared__ float smem[]; with a runtime size, the size must be passed as the third launch parameter. Passing the wrong size (or forgetting to pass it at all, defaulting to 0) produces either incorrect results or a silent runtime error, depending on what the kernel does with the out-of-bounds accesses.

Volta/Turing async copies. From Ampere onward, cp.async can load data from global directly into shared memory without going through registers, allowing compute and data movement to overlap. Kernels written for older architectures that do not use this path leave throughput on the table on newer hardware.

The memory hierarchy in brief

Tiling: the reuse argument

Bank conflicts and how to avoid them

Tiling in practice: Triton and cuBLAS

When it falls down

Further reading