torch.compile and TorchInductor

PyTorch 2.0 shipped a single function that, on average, made 165 open-source models run 20% faster at float32 and 36% faster under AMP - with no changes to the model code. That function is torch.compile. Understanding why it works requires following the graph from Python bytecode down to machine code.

The compilation pipeline in four layers

torch.compile is not one tool but a stack of four cooperating components:

Python source
    │
    ▼
TorchDynamo      (trace and capture the graph)
    │
    ▼
AOT Autograd     (capture the backward pass ahead-of-time)
    │
    ▼
Compiler backend (default: TorchInductor)
    │
    ▼
Triton / C++     (generated kernel code)

TorchDynamo sits at the CPython level. It hooks into the Frame Evaluation API (PEP 523), intercepts bytecode at the point of function entry, and symbolically executes operations to record a torch.fx graph. The key design choice: Dynamo does not require the user to write tracing-friendly code. If it encounters something it cannot capture (a C extension call, data-dependent branching, a print), it falls back gracefully and inserts a "graph break" - execution resumes in eager mode for that fragment.

AOT Autograd then captures both the forward and backward passes ahead of time, before any data flows through. This means the backward graph is also available to the compiler, enabling cross-pass fusion that is impossible in classic eager mode.

TorchInductor is the default backend. It lowers the torch.fx graph to a loop-level intermediate representation (loop IR), then emits either Triton kernels for CUDA/ROCm/Intel GPUs or vectorised C++ for CPU. The critical optimisation at this stage is operator fusion: instead of writing activations back to DRAM after every elementwise op, Inductor recognises fuseable chains and merges them into a single Triton kernel.

Stage	What it does	Output
TorchDynamo	Bytecode interception, graph capture	`torch.fx.Graph`
AOT Autograd	Ahead-of-time backward tracing	joint fwd+bwd graph
TorchInductor	Loop IR lowering + fusion	Triton / C++ source
Triton / nvcc	Hardware codegen	PTX / cubin

Why fusion matters so much

A na??ve PyTorch forward pass through a transformer block calls dozens of separate kernels. Each one:
1. Reads its input tensors from DRAM.
2. Computes.
3. Writes its output back to DRAM.

For elementwise chains (LayerNorm -> dropout -> residual add), the compute is trivially cheap relative to the memory roundtrips. The roofline model makes this concrete: a modern A100 has ~312 TFLOP/s of compute but only ~2 TB/s of memory bandwidth. An unfused four-operation chain that each move, say, 1 GB through DRAM consumes 4 x (1 GB / 2 TB/s) = 2 ms in memory time alone, even if the arithmetic finishes in microseconds.

The compilation pipeline in four layers

Why fusion matters so much

Keep reading with Pro.