← Concept library

Applied LLMs

XLA and Just-In-Time Compilation

XLA compiles a whole computation graph into fused, hardware-specific kernels at runtime, trading a one-time compilation cost for sustained throughput gains across GPUs and TPUs.

intermediate · 8 min read · Premium

A naive PyTorch training step launches hundreds of separate CUDA kernels per iteration. Each launch carries overhead, and between kernels the GPU stalls while intermediate tensors are written to and read back from HBM. XLA's central bet is that if you see the whole computation graph before executing any of it, you can collapse those hundreds of round-trips into a handful of fused operations that never touch slow memory between steps. That bet turned out to be correct enough to power Google's entire TPU stack.

What "compiling a graph" actually means

Every ML framework, at some level, represents a forward pass as a directed acyclic graph of operations (matmul, softmax, layer norm, ...). An eager runtime executes each node immediately as Python reaches it. A JIT compiler instead traces or captures that graph, hands it to an optimiser, and emits hardware-specific code for the whole thing.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied