Hardware Cost of Mixture-of-Experts

Mixtral 8x7B holds 47 billion parameters yet activates only 13 billion per token - roughly the same arithmetic as a 13B dense model. On paper, that sounds like free capacity. In practice, the hardware bill is considerably more complicated.

Why MoE Looks Cheap on Paper

A standard Transformer FFN applies the same weight matrix to every token. An MoE layer replaces one large FFN with E smaller expert FFNs and a lightweight router that picks k of them per token. The FLOP count per token scales with k, not with E, so as you add experts you get more capacity without proportionally more computation.

Let d be the model dimension, d_ff the per-expert hidden size, E the number of experts, and k the top-k selection count. The FLOP ratio relative to a dense FFN of size E * d_ff is approximately:

FLOP ratio ≈ k / E

For Mixtral (k=2, E=8) this is 0.25 - a 4x reduction in FFN compute. Switch Transformer (k=1) pushes this to 1/E. The roofline model says: if your workload is compute-bound, this is a genuine win.

The problem is that modern large-scale inference and training are rarely in the compute-bound regime for the FFN block alone.

The All-to-All Communication Wall

When you distribute experts across devices - which you must at scale, because each expert is a separate weight tensor - routing tokens to their assigned experts requires sending activations across the interconnect. The canonical pattern is a pair of all-to-all collectives per MoE layer:

Dispatch: scatter each token's hidden state to the device hosting its chosen expert.
Combine: gather the expert outputs back to the originating device.

Each all-to-all moves B * k * d * sizeof(dtype) bytes across the fabric, where B is the local batch size. On a 2048-TPU v3 pod (as in GShard), the inter-chip bandwidth is the bottleneck, not the expert FLOPs. The NVLink or ICI bandwidth does not scale linearly with the number of experts - it scales with ring topology and bisection bandwidth.

A rough cost model for a single all-to-all on N devices with bisection bandwidth B_net:

t_comm ≈ (B * k * d * dtype_bytes) * (N - 1) / (N * B_net)

At large N, this asymptotes to B * k * d * dtype_bytes / B_net, independent of N. That floor is non-trivial: for a batch of 512 tokens, d=4096, k=2, bfloat16, and B_net=600 GB/s (NVLink 3.0), you get roughly 14 microseconds per all-to-all, per MoE layer. A 32-layer model with MoE in half its layers accumulates ~224 microseconds of communication overhead per forward pass. Compared to a few hundred microseconds of total compute, that is not negligible.

Why MoE Looks Cheap on Paper

The All-to-All Communication Wall

Keep reading with Pro.