What a CUDA Kernel Is

A100 SXM4 delivers 312 teraFLOPS of FP16 compute. A naive matrix multiplication written in plain Python on the same machine achieves perhaps 0.1% of that. The gap is not a hardware secret or a library magic trick: it is almost entirely explained by understanding what a CUDA kernel is and what happens when you do not write one correctly.

The single idea: one function, thousands of instances

A CPU function executes once per call. A CUDA kernel executes N copies of itself in parallel, where N is a number you specify at launch. NVIDIA calls each copy a thread. The programmer's contract is: write the function as if it runs on one element, then tell the GPU how many threads to spawn.

A minimal kernel in CUDA C++:

__global__ void add_vectors(float* a, float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

// Launch: 1024 blocks of 256 threads each = 262 144 threads total
add_vectors<<<1024, 256>>>(d_a, d_b, d_c, n);

The __global__ qualifier marks it as a kernel (called from host, runs on device). The triple-angle-bracket syntax <<<grid, block>>> is the launch configuration: it says "arrange threads into a grid of blocks; each block holds a fixed number of threads." Inside the kernel, threadIdx.x is a thread's local ID within its block, and blockIdx.x is the block's ID within the grid. Combining them gives each thread a globally unique integer used to index into data.

That index arithmetic is the entire programming model. Everything else (caches, warps, shared memory) is performance detail layered on top of this simple idea.

How threads actually execute: warps and SIMT

Threads do not run one-at-a-time. The GPU hardware groups 32 consecutive threads into a warp and executes them in lockstep on a single streaming multiprocessor (SM). This is called SIMT: Single Instruction, Multiple Threads.

Within a warp, all 32 threads execute the same instruction simultaneously but on different data - the same principle as CPU SIMD, except the programmer does not write explicit vector intrinsics. The hardware handles it.

The consequence: branching within a warp is expensive. If threads 0-15 take the if branch and threads 16-31 take the else branch, the hardware serialises both paths and uses a mask to suppress results. Effective throughput halves. For most ML kernels (matmul, softmax, layer norm), branching is rare and this rarely matters. In quantised inference kernels with per-group scale decisions, it can become a bottleneck.

The memory hierarchy: why coalescing matters

GPU memory bandwidth is the limiting resource for most workloads, not raw compute. An A100 has 2 TB/s of HBM2e bandwidth; a single uncoalesced memory access pattern can drop effective bandwidth to a fraction of that.

The rule: when 32 threads in a warp load data, they should load from a contiguous, aligned 128-byte region. The hardware then issues a single memory transaction for the whole warp. If threads access scattered addresses, the hardware issues up to 32 separate transactions, wasting 31/32 of the bus capacity.

Access pattern	Transactions per warp	Effective BW vs peak
Coalesced, aligned	1 (128 B)	~100%
Stride-2	2	~50%
Fully random	Up to 32	~3%

In practice, PyTorch's built-in operators (matmul, conv2d, attention) already do this correctly. You only face the problem when writing custom kernels or when naive indexing in a fused kernel accidentally introduces stride patterns.

Shared memory and tiling: fitting the roofline

An SM has a small, fast scratchpad called shared memory (48-96 KB per SM on recent hardware, configurable). Shared memory sits at L1 distance from the compute units - roughly 100x lower latency than HBM. Kernels that need to re-read the same data (the canonical case is matrix multiplication, where each output element requires an entire row and column) use a tiling strategy:

A block of threads cooperatively loads a tile of the input matrices into shared memory.
Every thread in the block reads from shared memory to compute partial dot products.
The block advances to the next tile, accumulating the result.

This transforms HBM bandwidth from O(N^3) to O(N^3/tile_size), which is why a well-written GEMM kernel can approach arithmetic intensity limits rather than bandwidth limits.

The roofline model formalises this: a kernel's achievable performance is bounded by min(peak_FLOPS, peak_BW * arithmetic_intensity). Tiling raises arithmetic intensity past the ridge point so compute, not memory, becomes the bottleneck.

From CUDA C++ to PyTorch: who writes kernels today

Most engineers do not write CUDA C++ kernels by hand. The typical stack:

PyTorch operators (torch.mm, F.softmax) call into highly optimised libraries (cuBLAS, cuDNN, CUTLASS) that contain hand-tuned kernels.
Triton (used internally by torch.compile) lets you write kernels in Python-like syntax; it handles tiling, shared memory management, and instruction selection automatically. The vector-add example from the official Triton docs closely mirrors the CUDA C++ example above, but without explicit memory transaction management.
torch.compile with Inductor traces PyTorch operations and emits fused Triton (or CUDA) kernels automatically, collapsing multiple operator calls into one kernel launch to eliminate the overhead of reading and writing HBM between steps.

Understanding the CUDA model is still necessary even if you never write __global__ yourself. It is the vocabulary for reading profiler output (ncu, nsys), diagnosing memory-bound vs compute-bound regressions, and evaluating whether a custom operator is worth the engineering cost.

When it falls down

Occupancy bottlenecks from register pressure. Each thread in a kernel uses a number of registers. An SM has a fixed register file (65 536 32-bit registers on recent architectures). If your kernel uses 128 registers per thread and blocks of 256 threads, only 2 blocks can live on the SM simultaneously: that is 512 active threads out of a theoretical maximum of 2048. Low occupancy means fewer warps to hide memory latency, and the SM stalls waiting for loads to return.

Shared memory bank conflicts. Shared memory is divided into 32 banks. If multiple threads in a warp access the same bank simultaneously (but different addresses), the accesses serialise. A naively written transpose kernel hits this by default.

Launch overhead at small batch sizes. Each kernel launch carries a few microseconds of CPU-side overhead. At inference with batch size 1, a model with hundreds of small operations can spend more time on launch overhead and HBM round-trips between kernels than on actual arithmetic. This is the primary motivation for CUDA Graphs (replay a pre-recorded sequence of launches) and kernel fusion.

Warp divergence in quantised kernels. Mixed-precision or grouped-quantisation kernels that choose different dequantisation paths per group can create warp-level divergence, negating some of the throughput gains from reducing data volume.

Host-device synchronisation. Any cudaDeviceSynchronize() call or .item() call in PyTorch stalls the CPU until the GPU finishes, serialising the pipeline. In training loops with per-step loss logging this is a common, silent performance killer.