← Concept library

Applied LLMs

Graph Capture and CUDA Graphs

CUDA Graphs record a sequence of GPU operations as a reusable execution graph, eliminating per-kernel CPU launch overhead and enabling significant throughput gains for workloads with static shapes and control flow.

intermediate · 8 min read · Premium

Every time PyTorch dispatches a kernel to the GPU, it pays a tax: Python overhead, C++ dispatch, and a CUDA API call that wakes the driver. For a transformer forward pass with hundreds of small kernels, this CPU-side chatter can cost more wall-clock time than the actual arithmetic. On an A100 running BERT at batch size 1, CPU overhead can account for 30-50% of total latency. CUDA Graphs are the mechanism that cuts that tax to near zero.

What the CUDA Graph Model Actually Is

A CUDA Graph is a directed acyclic graph (DAG) where each node represents a GPU operation: a kernel launch, a memory copy, a memory set, or a host-function call. Edges encode dependencies. Instead of submitting nodes one at a time through the CUDA stream, you submit the entire DAG in a single cudaGraphLaunch() call.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied