Mixture of Experts

A Mixture of Experts (MoE) layer replaces a single feed-forward block with N parallel "expert" feed-forwards and a router that picks which experts to activate per token. You get the parameter count of a much larger model with the FLOPs of a much smaller one.

How routing works

For each token, the router computes a softmax over the experts and selects the top-k (typically k=2). Only those experts run; the rest sit idle.

expert_scores = softmax(W_router @ token_hidden)
top_k_experts = top_k(expert_scores)
output = sum(score_i * expert_i(token) for i in top_k_experts)

Why it works in production

Total parameters scale (Mixtral 8x7B has 47B total).
Active parameters per token stay low (Mixtral activates ~13B per token).
Inference cost is dominated by active parameters, not total.

This is why Mixtral, DeepSeek-V3, Grok-1, and the rumoured GPT-4 architecture all use MoE.

What makes it hard

Load balancing. Without an auxiliary loss, the router collapses onto a few popular experts; the rest are never used.
Communication cost. Distributed training has to ship hidden states to whichever GPU holds the chosen expert. Expert-parallel sharding is its own discipline.
Throughput vs latency. Batched inference is efficient but adds latency. Single-stream serving is the opposite.

Sparse vs dense at the same compute budget

For the same training FLOPs, MoE models typically reach better loss than dense models because parameters are cheap and FLOPs are expensive. The trade-off is engineering complexity. For a tiny model (under 7B active), dense usually wins.

How routing works

Why it works in production

What makes it hard

Sparse vs dense at the same compute budget

Keep reading with Pro.