Graph Capture and CUDA Graphs

Every time PyTorch dispatches a kernel to the GPU, it pays a tax: Python overhead, C++ dispatch, and a CUDA API call that wakes the driver. For a transformer forward pass with hundreds of small kernels, this CPU-side chatter can cost more wall-clock time than the actual arithmetic. On an A100 running BERT at batch size 1, CPU overhead can account for 30-50% of total latency. CUDA Graphs are the mechanism that cuts that tax to near zero.

What the CUDA Graph Model Actually Is

A CUDA Graph is a directed acyclic graph (DAG) where each node represents a GPU operation: a kernel launch, a memory copy, a memory set, or a host-function call. Edges encode dependencies. Instead of submitting nodes one at a time through the CUDA stream, you submit the entire DAG in a single cudaGraphLaunch() call.

The workflow has three phases:

Capture. Wrap your workload between cudaStreamBeginCapture() and cudaStreamEndCapture(). During capture the driver records every CUDA API call but does not execute them. You end up with a cudaGraph_t object representing the DAG.
Instantiation. Call cudaGraphInstantiate() to compile the DAG into an executable form (cudaGraphExec_t). This is a one-time cost, typically a few hundred microseconds.
Replay. Call cudaGraphLaunch(exec, stream) as many times as you like. Each call launches the full DAG with one driver interaction instead of hundreds.

The GPU still executes the same kernels in the same order. What changes is the cost of telling it to. NVIDIA's own measurements put individual kernel launch overhead at roughly 2-4 microseconds; for a 300-kernel forward pass this adds up to ~1 ms of pure overhead per iteration, which CUDA Graphs eliminates.

PyTorch's Graph API

PyTorch exposes CUDA Graphs through two interfaces:

torch.cuda.CUDAGraph (low-level):

g = torch.cuda.CUDAGraph()

# Warmup: run the model once on the real stream so caches are hot
with torch.cuda.stream(torch.cuda.Stream()):
    for _ in range(3):
        y = model(x)

# Capture
with torch.cuda.graph(g):
    y = model(x)

# Replay (x and y are now fixed memory addresses)
x.copy_(new_input)
g.replay()
result = y.clone()

The critical point is that x and y are static buffers. The graph captures the tensor addresses, not their contents. To feed new inputs you copy data into the same allocation; to read output you copy from the same allocation. This single constraint drives most of the practical complexity.

torch.cuda.make_graphed_callables (module-level):

Wraps individual nn.Module or functions and returns graph-accelerated versions, handling the static-buffer bookkeeping automatically. This is safer for models where only part of the computation is graph-safe.

torch.compile with mode="reduce-overhead":

The highest-level option. torch.compile invokes TorchDynamo to trace the graph and TorchInductor to generate kernels, then wraps captured regions in CUDA Graphs automatically. The reduce-overhead mode is the intended path for most users since PyTorch 2.0.

What the CUDA Graph Model Actually Is

PyTorch's Graph API

Keep reading with Pro.