← Concept library

Applied LLMs

Custom Kernels for Mixture-of-Experts

MoE models break the dense-GEMM assumption that GPU libraries are optimised for, so efficient inference requires custom grouped-GEMM and block-sparse kernels that handle variable-length expert batches without padding or token dropping.

advanced · 8 min read · Premium

The padding trap

A standard transformer feed-forward layer receives a fixed-shape tensor and calls one big matrix multiplication. Mixture-of-Experts (MoE) breaks this assumption entirely. A router sends each token to one or two of E experts; the number of tokens landing on any given expert varies every forward pass. If expert 3 receives 47 tokens this step and expert 7 receives 312, a naive implementation pads every expert's batch to capacity = C tokens and then calls E separate GEMMs, one per expert.

That approach wastes compute in two distinct ways. First, padded zeros still burn FLOPs and memory bandwidth. Second, launching E small GEMMs is catastrophically inefficient: each GEMM needs to reach a tile-occupancy sweet spot before it becomes bandwidth-bound in the right way; tiny GEMMs never get there. Switch Transformer (Fedus et al., 2021) reported that naive padding with a capacity factor of 1.25 drops about 3-10% of tokens on overflow, yet the kernel utilisation is still poor because the per-expert batches are too small to hide memory latency.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied