Applied LLMs
TPU Systolic Arrays
A systolic array is a grid of multiply-accumulate units wired to pass partial sums directly between neighbours, letting Google's TPU sustain 92 TOPS on matrix multiplication without repeatedly hitting off-chip memory.
intermediate · 7 min read
The first-generation Google TPU shipped a 256x256 grid of 8-bit multipliers. At 700 MHz, that is 92 TOPS of raw matrix throughput, yet the chip draws roughly a tenth of the power a GPU of that era would need for the same workload. The entire trick is a piece of 1980s computer-science that most ML practitioners have never had to think about: the systolic array.
The memory wall and why it forces a rethink
Modern CPUs and GPUs spend most of their time waiting for data. The "memory wall" is the growing gap between compute throughput (doubling roughly every two years) and memory bandwidth (improving far more slowly). A naive matrix-multiply kernel on a GPU fires thousands of threads, each fetching a value from DRAM, multiplying, writing back, and repeating. For large matrices the arithmetic intensity - floating-point operations per byte of memory traffic - is only moderate unless you tile carefully into shared memory.
The roofline model makes this precise. For a problem with arithmetic intensity I (ops/byte) and a machine with peak compute Pi (ops/s) and peak bandwidth Beta (bytes/s):
attainable_perf = min(Pi, Beta * I)
A 256x256 systolic array running bfloat16 multiply-accumulate has an effective arithmetic intensity orders of magnitude higher than a naive implementation because operands are never written back to memory between multiplications. They flow through the array, visiting each processing element (PE) exactly once.
How a systolic array actually works
Picture a grid of PE cells. Each cell holds one accumulator register. Data enters from two edges:
- Activations stream in from the left, one row per clock cycle, shifting rightward.
- Weights are pre-loaded into the cells (or stream in from the top, depending on the design).
At every clock tick, each PE:
1. Reads the activation arriving from its left neighbour.
2. Multiplies it by the locally stored weight.
3. Adds the product to its running accumulator.
4. Passes the activation unchanged to its right neighbour.
After N cycles (for an NxN array), every cell has accumulated a complete dot product. The result matrix drains out the bottom.
Here is a minimal pseudo-trace for a 2x2 systolic array computing C = A * B:
Weights pre-loaded:
PE[0,0]=B[0,0], PE[0,1]=B[0,1]
PE[1,0]=B[1,0], PE[1,1]=B[1,1]
Cycle 1: A[0,0] enters row 0
PE[0,0].acc += A[0,0]*B[0,0] → pass A[0,0] right to PE[0,1]
PE[0,1].acc += A[0,0]*B[0,1]
Cycle 2: A[0,1] enters row 0; A[1,0] enters row 1
PE[0,0].acc += A[0,1]*B[0,0]
PE[1,0].acc += A[1,0]*B[1,0] (row 1 also shifting rightward)
... and so on
After 4 cycles each PE[i,j].acc holds C[i,j] = sum_k A[i,k]*B[k,j]
The critical observation: each activation value is loaded from memory once and then propagates across the entire row by wire, not by re-fetching. The weights are stationary. No intermediate results ever leave the chip during the computation. Memory traffic is proportional to O(N^2) while arithmetic is O(N^3), so arithmetic intensity scales as O(N) - exactly what the roofline model wants.
How the TPU instantiates this
The TPU v1 Matrix Multiply Unit (MXU) is a 256x256 systolic array of 8-bit integer MACs, accumulating into 32-bit registers. The TPU v2 and later generations use 128x128 arrays of bfloat16 MACs (also accumulating in FP32). Google's official documentation states that each MXU performs 16,384 multiply-accumulate operations per cycle.
A TensorCore on a TPU chip wraps the MXU with:
| Component | Role |
|---|---|
| Matrix Multiply Unit (MXU) | Systolic array; bulk matmul |
| Vector Processing Unit (VPU) | Elementwise ops, activations, layer norm |
| Scalar unit | Control flow, address computation |
| High Bandwidth Memory (HBM) | Operand staging, weight buffers |
Data flows: host streams batches to HBM via PCIe/ICI; the VPU prepares tiles; the MXU consumes them in a steady systolic rhythm; partial results accumulate on-chip; the VPU applies non-linearities; output is written back to HBM.
The weight-stationary dataflow is the default for inference: weights are loaded once into the array, and many activation vectors are streamed through. For training, an output-stationary variant is sometimes used to keep gradient accumulation on-chip.
Determinism as a design goal
GPUs rely heavily on caches, out-of-order execution, and dynamic scheduling to hide latency. This makes throughput hard to predict but keeps general-purpose code fast. The TPU makes the opposite trade: the systolic array has no caches and no dynamic scheduling. Data arrives in lockstep, the wavefront propagates at a fixed rate, and latency is entirely predictable from the matrix dimensions. That determinism is not an accident. Datacenter inference must meet tail-latency SLOs; a design that occasionally stalls on a cache miss is harder to provision than one with fixed, provable execution time.
When it falls down
Small or irregular matrices. The 128x128 MXU takes 128 cycles to drain for a 128x128 matmul. If you feed it a 32x32 matrix, three-quarters of the PEs sit idle every cycle. Operations like per-token layer norm or small batch sizes can leave utilisation well below 20%. This is the primary reason that batching and sequence-packing matter so much on TPUs.
Sparse weight matrices. Systolic arrays assume dense arithmetic. A 50%-sparse weight matrix still takes the same number of cycles as a dense one; you get no throughput benefit from sparsity unless the compiler can densify or reshape the tiles first. GPUs with tensor core sparse support (Ampere's 2:4 structured sparsity) have an advantage here.
Non-matmul operations. Attention score computation is matmul-friendly; softmax, layer normalisation, and rotary positional embeddings are not. They execute on the VPU at lower TOPS than the MXU delivers. A model architecture that leans heavily on elementwise operations will see the VPU become the bottleneck.
Fixed numerical formats. The MXU accepts bfloat16 (or int8 on v1). Experimental training runs requiring FP32 accumulation throughout, or FP8 formats for aggressive quantisation, require separate compiler support and may not fully utilise the array.
Compilation rigidity. The XLA compiler must tile every matmul to match the MXU dimensions exactly. Odd model dimensions (not multiples of 128) require padding, which wastes both compute and memory. Framework-level ops that XLA cannot lower to matmul calls fall back to VPU or host execution, often with significant overhead.
Further reading
- In-Datacenter Performance Analysis of a Tensor Processing Unit (Jouppi et al., ISCA 2017) - the original TPU paper with full MXU specs and workload benchmarks.
- TPU System Architecture - Google Cloud documentation - official description of MXU dimensions, TensorCore layout, and HBM data flow for current TPU generations.
- Systolic Array - Wikipedia - clear diagrams of the original Kung and Leiserson design from 1978, which the TPU MXU directly descends from.