XLA and Just-In-Time Compilation

A naive PyTorch training step launches hundreds of separate CUDA kernels per iteration. Each launch carries overhead, and between kernels the GPU stalls while intermediate tensors are written to and read back from HBM. XLA's central bet is that if you see the whole computation graph before executing any of it, you can collapse those hundreds of round-trips into a handful of fused operations that never touch slow memory between steps. That bet turned out to be correct enough to power Google's entire TPU stack.

What "compiling a graph" actually means

Every ML framework, at some level, represents a forward pass as a directed acyclic graph of operations (matmul, softmax, layer norm, ...). An eager runtime executes each node immediately as Python reaches it. A JIT compiler instead traces or captures that graph, hands it to an optimiser, and emits hardware-specific code for the whole thing.

XLA (Accelerated Linear Algebra) follows a three-stage pipeline:

Framework lowering. JAX, TensorFlow, or PyTorch/XLA converts Python ops into StableHLO, a portable intermediate representation (IR) with a stable opset. StableHLO is then lowered further into XLA's internal HLO (High-Level Operations) dialect.
Target-independent optimisation. XLA runs algebraic simplification, common-subexpression elimination (CSE), and - crucially - operation fusion. Two element-wise ops that would each read and write a full tensor become one fused op with a single memory round-trip.
Backend code generation. The GPU backend emits PTX/SASS via LLVM; the TPU backend produces TPU assembly. Each backend can apply further target-specific fusion decisions before final emission.

The result is a compiled artefact bound to the exact tensor shapes seen during tracing. New shapes require re-compilation, which is why shape dynamism is one of XLA's most persistent pain points.

HLO fusion: the mechanism behind the speedup

Fusion is not a vague "it batches things together" idea. It has a precise meaning: two HLO operations are fused when their combined kernel reads inputs once, computes both ops in registers, and writes outputs once.

Consider GELU applied after a linear projection:

# unfused - two kernel launches, two HBM round-trips
x = matmul(W, h)          # write x to HBM
y = gelu(x)               # read x from HBM, write y to HBM

# fused - one kernel launch, one HBM round-trip
y = fused_matmul_gelu(W, h)

For a 4096-wide activation on an H100, each HBM round-trip costs roughly 1-2 microseconds at peak bandwidth (3.35 TB/s). Multiply that across hundreds of element-wise ops per transformer layer, thousands of layers, and millions of training steps, and the savings compound.

XLA's fusion heuristic groups producers and consumers greedily, subject to register pressure limits. The compiler also performs layout assignment - choosing row-major vs. column-major for each tensor - to avoid transpose overhead when feeding into cuBLAS.

What "compiling a graph" actually means

HLO fusion: the mechanism behind the speedup

Keep reading with Pro.