Mixture-of-Experts Inference

MoE models look like a serving dream: DeepSeek-V3 activates 37B of its 671B parameters per token, so it should serve like a 37B dense model. In training, broadly, it does. In inference it absolutely does not. You still have to hold all 671B parameters in memory because any token might route to any expert, expert routing creates load imbalance, small batches catastrophically under-utilise the GPUs, and expert-parallel sharding adds an all-to-all communication step that dense models never pay. The throughput win is real, but the engineering bill is significant.

Why serving MoE is harder than dense

A dense model's inference cost is weights_read + kv_read per token. Predictable, contiguous, friendly to bandwidth optimisation. An MoE model adds three new problems:

You still hold all experts in HBM. Active parameters dictate FLOPs, but total parameters dictate memory. Mixtral 8x7B is 47B parameters; you cannot serve it on a 24 GB GPU even though only ~13B activate per token.
Routing is data-dependent. The router decides per token which experts to fire. You discover the GEMM shapes at runtime, which fights every static-shape optimisation a compiler wants to do.
Load is rarely balanced. Even with auxiliary balance losses, popular experts get 2-3x the load of cold ones during inference. The slowest expert sets the step time.

Top-k routing and the memory bill

Most production MoE models use top-2 routing: each token activates 2 of N experts. Memory budget for inference:

Model	Total params	Active/token	HBM at fp16	HBM at FP8
Mixtral 8x7B	47B	~13B	94 GB	47 GB
Mixtral 8x22B	141B	~39B	282 GB	141 GB
DeepSeek-V3	671B	37B	~1.3 TB	~670 GB
Llama-4 Maverick	400B	17B	800 GB	400 GB

You read only the active experts' weights per token, so decode bandwidth is bounded by active parameters - that is the whole point. But you cannot evict cold experts cheaply because the next token may route to them. Holding the full model resident is non-negotiable for any non-trivial throughput.

The small-batch problem

This is the one that surprises people. Suppose batch size is 1 and you generate one token. The router picks 2 of 256 experts. Those 2 experts each do a single matmul on a single token's hidden state - call it a 1-by-d-by-h GEMM. Tensor cores want big M dimensions; 1 is the worst possible shape. Meanwhile the other 254 experts sit idle, taking up HBM and contributing nothing.

You loaded the full model. You got the FLOPs of a 1-token, 2-expert decode. Effective utilisation is in the low single digits.

Why serving MoE is harder than dense

Top-k routing and the memory bill

The small-batch problem

Keep reading with Pro.