Applied LLMs
Hardware Cost of Mixture-of-Experts
Sparse MoE models reduce FLOPs per token but introduce all-to-all communication, load-imbalance penalties, and memory pressure that can erase those savings unless the system is carefully co-designed.
advanced · 9 min read · Premium
Mixtral 8x7B holds 47 billion parameters yet activates only 13 billion per token - roughly the same arithmetic as a 13B dense model. On paper, that sounds like free capacity. In practice, the hardware bill is considerably more complicated.
Why MoE Looks Cheap on Paper
A standard Transformer FFN applies the same weight matrix to every token. An MoE layer replaces one large FFN with E smaller expert FFNs and a lightweight router that picks k of them per token. The FLOP count per token scales with k, not with E, so as you add experts you get more capacity without proportionally more computation.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.