Custom Kernels for Mixture-of-Experts

The padding trap

A standard transformer feed-forward layer receives a fixed-shape tensor and calls one big matrix multiplication. Mixture-of-Experts (MoE) breaks this assumption entirely. A router sends each token to one or two of E experts; the number of tokens landing on any given expert varies every forward pass. If expert 3 receives 47 tokens this step and expert 7 receives 312, a naive implementation pads every expert's batch to capacity = C tokens and then calls E separate GEMMs, one per expert.

That approach wastes compute in two distinct ways. First, padded zeros still burn FLOPs and memory bandwidth. Second, launching E small GEMMs is catastrophically inefficient: each GEMM needs to reach a tile-occupancy sweet spot before it becomes bandwidth-bound in the right way; tiny GEMMs never get there. Switch Transformer (Fedus et al., 2021) reported that naive padding with a capacity factor of 1.25 drops about 3-10% of tokens on overflow, yet the kernel utilisation is still poor because the per-expert batches are too small to hide memory latency.

The right answer is a kernel purpose-built for variable-length expert batches.

Grouped GEMM: the first-principles solution

The grouped GEMM abstraction computes a set of independent matrix multiplications in a single kernel launch:

for i in 0..E:
    C_i = A_i @ W_i        # A_i has shape (n_i, d_model), W_i has shape (d_model, d_ff)

Here n_i differs per expert. A grouped GEMM kernel tiles across all expert problems simultaneously, scheduling warps to whichever sub-problem has the best occupancy at that moment. NVIDIA's cuBLAS exposes cublasGemmBatchedEx and the more efficient cublasGemmGroupedBatchedEx, but both require host-side metadata arrays describing each problem's size and data pointer.

The tricky part is that those metadata arrays must be assembled on the GPU after routing (because routing is dynamic), then read back by the kernel. Naive implementations add a CPU round-trip between routing and the GEMM, breaking the computation graph and preventing CUDA graph capture.

High-performance MoE stacks (Tutel, FasterMoE, Megablocks) solve this by fusing the sort/dispatch step with the grouped GEMM launch: the routing kernel writes its output metadata directly into GPU-resident descriptor buffers, and the GEMM kernel reads from those buffers without ever touching the CPU.

Block-sparse kernels: the MegaBlocks approach

MegaBlocks (Gale et al., 2022) reframes MoE computation as a block-sparse matrix multiplication. Instead of E separate dense sub-problems, all expert weight matrices are conceptually stacked into one large block-sparse matrix W of shape (E * d_ff, d_model), and the token-to-expert assignments define the sparsity pattern.

# Conceptual block-sparse view
W_sparse = block_diag(W_0, W_1, ..., W_{E-1})   # shape: (E*d_ff, d_model)
X_routed = scatter(X, routing_indices)            # permuted token matrix
Y = X_routed @ W_sparse                          # block-sparse GEMM

This formulation never pads and never drops tokens. The kernel traverses only non-zero blocks, so its memory footprint scales with actual traffic, not maximum capacity. Gale et al. report up to 40% throughput improvement over Tutel and 2.4x over Megatron-LM on certain configurations, precisely because tiles map cleanly to warp-level parallelism without empty work.

The padding trap

Grouped GEMM: the first-principles solution

Block-sparse kernels: the MegaBlocks approach

Keep reading with Pro.