Profiling GPU Workloads

A training step for a 7B-parameter model may spend 30% of its wall-clock time waiting on memory transfers that could be hidden with better tiling - but only if you can see the transfers at all. Without a profiler, every optimisation is archaeology: you make a change, time the whole run, and wonder whether the 2% speedup was real or noise. With a profiler you get a timeline that shows, to the microsecond, which kernel ran, how long it stalled on L2 misses, and whether the GPU was idle while the CPU was still queuing the next batch.

This concept explains how GPU profiling works, which tools to reach for at each layer of investigation, and which metrics actually predict whether a kernel is bottlenecked on compute or on memory.

The Two Bottleneck Categories

Every GPU kernel is either compute-bound or memory-bound. Understanding which one applies is the first job of any profiling session.

A kernel is compute-bound when arithmetic throughput (measured in TFLOP/s) is the limiting resource. Tensor-core GEMM on large matrices is the canonical case. A kernel is memory-bound when the rate at which data moves between DRAM and the SM register file is the constraint. Element-wise activations, layer normalisation, and many attention variants fall here.

The roofline model makes this crisp. Define arithmetic intensity as:

I = FLOPs / bytes_transferred

If I exceeds the machine's ridge point (peak FLOPs / peak memory bandwidth), the kernel is compute-bound; below it, memory-bound. An A100 SXM4 has a ridge point around 300 FLOP/byte for FP16 tensor-core operations. A softmax over a [batch, seq_len] tensor rarely exceeds 10 FLOP/byte, so it is firmly memory-bound regardless of how you schedule threads.

Knowing which regime you are in tells you where optimisation effort belongs: a memory-bound kernel gets faster from better tiling, coalescing, and fusion; a compute-bound kernel needs better tensor-core utilisation and register occupancy.

The Tool Stack

Profiling GPU workloads involves three layers of tooling, each answering a different question.

System-level: Nsight Systems (nsys)

Nsight Systems captures a timeline of everything: CPU threads, CUDA API calls, kernel launches, PCIe transfers, NCCL collectives, and NVLink traffic. It adds almost no overhead because it uses hardware performance counters and OS-level tracing. Use it first to find the bottleneck region before drilling deeper.

nsys profile --trace=cuda,nvtx,osrt \
             --output=profile_run \
             python train.py

The resulting .nsys-rep file opens in the Nsight Systems GUI. The timeline view immediately shows whether kernels overlap with data transfers (they should), whether there are long CPU gaps between kernel launches (a sign of Python overhead or synchronisation), and whether NVLink is saturating during an allreduce.

Kernel-level: Nsight Compute (ncu)

Once Nsight Systems has identified the hot kernels, Nsight Compute re-runs each one with full hardware counters to answer why it is slow. It reports achieved vs. theoretical occupancy, SM active cycles, memory throughput, cache hit rates, warp stall reasons, and whether tensor cores are actually being used.

The Two Bottleneck Categories

The Tool Stack

Keep reading with Pro.