Applied LLMs
Graph Capture and CUDA Graphs
CUDA Graphs record a sequence of GPU operations as a reusable execution graph, eliminating per-kernel CPU launch overhead and enabling significant throughput gains for workloads with static shapes and control flow.
intermediate · 8 min read · Premium
Every time PyTorch dispatches a kernel to the GPU, it pays a tax: Python overhead, C++ dispatch, and a CUDA API call that wakes the driver. For a transformer forward pass with hundreds of small kernels, this CPU-side chatter can cost more wall-clock time than the actual arithmetic. On an A100 running BERT at batch size 1, CPU overhead can account for 30-50% of total latency. CUDA Graphs are the mechanism that cuts that tax to near zero.
What the CUDA Graph Model Actually Is
A CUDA Graph is a directed acyclic graph (DAG) where each node represents a GPU operation: a kernel launch, a memory copy, a memory set, or a host-function call. Edges encode dependencies. Instead of submitting nodes one at a time through the CUDA stream, you submit the entire DAG in a single cudaGraphLaunch() call.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.