Operator Lowering and IRs

A PyTorch matmul call carries no information about cache lines, warp occupancy, or register pressure. The CUDA kernel that eventually runs on the GPU does. Something has to bridge that gap, and that something is a lowering pipeline: a chain of intermediate representations (IRs) where each step either optimises at one abstraction level or discards abstraction to expose the level below. Getting this chain right is why torch.compile can give 2-3x throughput gains on the same hardware without changing a single line of user code.

What an IR actually is

An IR is a data structure that represents a computation in a form that is simultaneously easy to analyse and easy to transform. Unlike source code, an IR is designed for machines to read. Unlike binary instructions, it retains enough structure to permit rewrites that would be impossible or illegal at a lower level.

A useful mental model: each IR level answers a different question.

IR level	Representative form	Question it answers
Graph IR	PyTorch FX graph, JAX jaxpr, XLA HLO	What computations depend on each other?
Affine / loop IR	MLIR Affine dialect, Halide schedules	How are loops structured and bounded?
Memory IR	MLIR MemRef dialect, LLVM IR	Where do operands live; what are the access patterns?
Target IR	PTX, AMDGPU ISA, LLVM bitcode	Which hardware instructions execute?

Lowering moves a program from the top row toward the bottom. At each step the compiler can apply passes that are only sound at that abstraction level. Loop tiling is meaningless in a graph IR; algebraic simplification of exp(log(x)) is hard to spot in PTX.

The MLIR dialect stack

MLIR (Multi-Level IR) formalises this idea by making dialect switching explicit. Every operation belongs to a named dialect; a lowering pass converts operations from one dialect into operations in another. The Toy tutorial in the MLIR documentation (chapters 3 and 5) demonstrates this concretely: a high-level toy.transpose is first canonicalised within the Toy dialect, then partially lowered to Affine and MemRef operations for loop-level optimisation, and finally fully lowered to LLVM IR for code generation.

A simplified trace of a matrix multiply through the stack looks like this:

linalg.matmul ins(%A, %B) outs(%C)   # structured op: knows it's a matmul
  -> affine.for loops with affine.load/store  # explicit loops, affine bounds
    -> memref.load/store + arith.mulf         # concrete memory, scalar arith
      -> llvm.load + llvm.fmul                # LLVM, ready for NVPTX backend
        -> PTX fma.rn.f32                     # hardware instruction

Each arrow is one or more MLIR conversion passes. Because the conversion framework enforces legality constraints, you cannot accidentally leave a linalg.matmul dangling in an LLVM IR module - the compiler will error at the conversion stage, not silently miscompile.

What an IR actually is

The MLIR dialect stack

Keep reading with Pro.