Mixture of Experts: How Frontier LLMs Achieve Scale Without Proportional Cost
June 02, 2026 · 27 min read
A 671-billion-parameter language model trained for $5.6 million. That is what DeepSeek-V3 achieved in late 2024, matching GPT-4o-class performance while spending roughly one-fifth the compute of comparable dense models (DeepSeek-AI, 2024). The trick was not a better optimizer or a novel attention mechanism. It was an architectural bet: build the model with 256 specialist sub-networks per layer, but activate only 8 of them for any given token. The model sees everything during training; it uses almost nothing during inference.
This is the Mixture of Experts (MoE) paradigm, and it is now the default architecture for frontier-scale language models. GPT-4, Mixtral, Grok-1, DBRX, DeepSeek-V2/V3, Qwen-MoE: the list of models built on sparse expert routing keeps growing. The economics are too compelling to ignore. If you can get 90% of the quality by activating 5% of the parameters, the question is not whether to use MoE. It is how to make the routing stable, the experts specialised, and the inference memory manageable.
Why this matters: MoE breaks the linear relationship between parameter count and compute cost. Understanding gating functions, load balancing, and expert collapse is now prerequisite knowledge for anyone building, fine-tuning, or deploying large language models. The architecture choices made in MoE design directly determine training efficiency, inference latency, and serving cost.
TL;DR
- MoE models contain many "expert" sub-networks but activate only a small subset (typically 1-8) per token, achieving dense-model quality at a fraction of the compute.
- The gating (router) network learns which experts to activate via a softmax over a learned projection, selecting the top-k scoring experts.
- Load balancing losses prevent "expert collapse," where the router learns to send all tokens to a few favourite experts, leaving others permanently untrained.
- DeepSeek-V3's fine-grained approach (256 experts, 8 active) with shared experts and auxiliary-loss-free balancing represents the current state of the art.
- MoE models require 4-8x more memory than their active parameter count suggests, because all expert weights must be resident even though most are idle per token.
- Training MoE is roughly 2x more FLOP-efficient than training dense models to the same quality level, with advantages growing at larger compute budgets.
- Expert choice routing (experts pick tokens instead of tokens picking experts) eliminates load imbalance by construction but changes the computation graph.
- The dominant practical challenge is inference: all-to-all communication in expert parallelism can cost 9x the bandwidth of tensor parallelism.
At a Glance
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e40af', 'primaryTextColor': '#fff', 'primaryBorderColor': '#60a5fa', 'lineColor': '#94a3b8', 'textColor': '#e2e8f0', 'clusterBkg': '#1e293b', 'clusterBorder': '#334155', 'fontSize': '16px'}}}%%
flowchart LR
subgraph Input["Token Processing"]
X["Input token x"]
R["Router network"]
end
subgraph Experts["Expert Pool (N total)"]
E1["Expert 1"]
E2["Expert 2"]
E3["Expert k"]
EN["Expert N"]
end
subgraph Output["Aggregation"]
W["Weighted sum"]
Y["Output y"]
end
X --> R
R -->|"gate score g1"| E1
R -->|"gate score g2"| E2
R -->|"gate score gk"| E3
R -.->|"score = 0"| EN
E1 --> W
E2 --> W
E3 --> W
W --> Y
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0
class X,R blue
class E1,E2,E3 purple
class EN slate
class W,Y emerald
Before Sparse Experts
The concept of splitting a model into specialist sub-networks is older than transformers, older than deep learning, older than GPU compute. The lineage traces back to 1991, when Jacobs, Jordan, Nowlan, and Hinton introduced "Adaptive Mixtures of Local Experts" (Jacobs et al., 1991). Their system divided a vowel discrimination task into subtasks, each handled by a small specialist network, with a gating network choosing which specialist to trust. The core insight has not changed in 35 years. What has changed is the scale at which it operates and the engineering required to make it work.
%%{init: {'theme': 'base', 'themeVariables': {'cScale0': '#1e40af', 'cScale1': '#6d28d9', 'cScale2': '#b45309', 'cScale3': '#be123c', 'cScale4': '#047857', 'cScale5': '#0e7490', 'cScale6': '#1e40af', 'cScaleLabel0': '#e2e8f0', 'cScaleLabel1': '#e2e8f0', 'cScaleLabel2': '#e2e8f0', 'cScaleLabel3': '#e2e8f0', 'cScaleLabel4': '#e2e8f0', 'cScaleLabel5': '#e2e8f0', 'cScaleLabel6': '#e2e8f0', 'textColor': '#e2e8f0', 'lineColor': '#94a3b8', 'fontSize': '16px'}}}%%
timeline
title The MoE Architecture Timeline
1991 : Jacobs et al. publish Adaptive Mixtures of Local Experts
: Gating network plus specialist sub-networks for classification
2017 : Shazeer et al. scale MoE to 137B parameters
: Sparsely-Gated MoE layer with top-2 routing and noise
2020 : GShard scales to 600B parameters across 2048 TPUs
: Top-2 routing in encoder-decoder for multilingual translation
2021 : Switch Transformer hits 1.6T parameters with top-1 routing
: 7x training speedup over T5-Base at matched quality
2022 : ST-MoE introduces router z-loss for training stability
: Expert choice routing lets experts pick tokens
2023 : Mixtral 8x7B ships 47B total, 13B active
: Matches LLaMA 2 70B at 6x faster inference
2024 : DeepSeek-V3 ships 671B total, 37B active for $5.6M
: DBRX ships 132B with fine-grained 16-choose-4 routing
For two decades after 1991, MoE remained a niche technique. The economics did not make sense when models were small enough to fit on a single machine. The inflection came in 2017, when Noam Shazeer and collaborators at Google Brain published "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (Shazeer et al., 2017). They demonstrated an LSTM language model with up to 137 billion parameters that achieved better perplexity than dense baselines while using comparable compute per token. The gating mechanism selected the top-k experts from thousands of candidates, with added noise to encourage exploration.
[IMAGE: Side-by-side comparison showing a dense transformer layer (left) versus an MoE transformer layer (right) where the FFN block is replaced by multiple expert FFN blocks with a router. Caption: "In a standard transformer, every token passes through the same feed-forward network. In an MoE transformer, a learned router selects which expert FFN blocks process each token."]
Google scaled the idea further in 2020 with GShard (Lepikhin et al., 2020), pushing MoE to 600 billion parameters across 2,048 TPU v3 cores for multilingual machine translation. Then came the Switch Transformer (Fedus et al., 2021), which simplified routing to top-1 (a single expert per token) and demonstrated 1.6 trillion parameters with a 7x pre-training speedup over T5-Base at matched quality. The race to scale MoE was on.
How Mixture of Experts Actually Works
The Gating Function
An MoE layer replaces the standard feed-forward network (FFN) in a transformer block with \(N\) parallel expert networks \(E_1, E_2, \ldots, E_N\) and a router (gating) network \(G\). Each expert is typically a standard FFN with the same architecture but independent weights. The router is a simple linear projection followed by softmax and top-k selection.
Given an input token representation \(x \in \mathbb{R}^d\), the router computes:
\[h(x) = x \cdot W_g\]where \(W_g \in \mathbb{R}^{d \times N}\) is a learned weight matrix. This produces a score for each of the \(N\) experts. The gating function then selects the top-\(k\) experts and zeros out the rest:
\[G(x) = \text{Softmax}(\text{TopK}(h(x), k))\]where \(\text{TopK}(v, k)\) keeps the \(k\) largest entries of \(v\) and sets the rest to \(-\infty\) before the softmax normalization. The final output of the MoE layer is the weighted sum of the selected experts' outputs:
\[y = \sum_{i=1}^{N} G(x)_i \cdot E_i(x)\]In practice, only \(k\) terms in this sum are nonzero, so only \(k\) expert forward passes are computed. This is the fundamental source of MoE efficiency: the parameter count scales with \(N\) (total experts), but the compute scales with \(k\) (active experts).
[IMAGE: Visualization of the gating function computation. Shows an input vector x being multiplied by the router weight matrix W_g to produce N scores, then top-k selection zeroing out all but k scores, then softmax normalization producing gate weights. Caption: "The gating function in three steps: project, select, normalize. Only k experts receive nonzero weight."]
Noisy Top-K Gating
Shazeer et al. (2017) observed that the router, left to its own devices, will converge to always selecting the same small set of experts. This positive feedback loop (popular experts train more, becoming even better, attracting even more tokens) leads to expert collapse. Their solution was to add tunable Gaussian noise before the top-k selection:
\[H(x) = h(x) + \text{StandardNormal}() \cdot \text{Softplus}(x \cdot W_{\text{noise}})\]The noise encourages the router to occasionally route tokens to less-popular experts, giving them training signal. The noise scale is itself learned, allowing the model to reduce exploration as training progresses and expert specialisations stabilise.
The Load Balancing Loss
Noise helps, but it is not sufficient at scale. Modern MoE systems add an explicit auxiliary loss that penalises imbalanced routing. The Switch Transformer formulation defines two vectors for a batch of \(T\) tokens and \(N\) experts:
\[f_i = \frac{1}{T} \sum_{x \in \text{Batch}} \mathbf{1}[\text{argmax}\, G(x) = i]\]This is the fraction of tokens actually routed to expert \(i\). And:
\[P_i = \frac{1}{T} \sum_{x \in \text{Batch}} \text{Softmax}(h(x))_i\]This is the average router probability assigned to expert \(i\) across the batch. The load balancing loss is:
\[\mathcal{L}_{\text{balance}} = \alpha \cdot N \cdot \sum_{i=1}^{N} f_i \cdot P_i\]where \(\alpha\) is a hyperparameter (typically 0.01). The product \(f_i \cdot P_i\) is minimised when the distribution is uniform: \(f_i = P_i = 1/N\) for all \(i\). The factor \(N\) normalises the loss so it remains comparable across different expert counts.
Why not just minimise the variance of \(f\)? Because \(f\) involves an argmax, which is not differentiable. The \(P\) term provides a smooth gradient signal that the router can learn from, while \(f\) provides the discrete routing decision for the actual forward pass.
[IMAGE: Bar chart showing token distribution across 8 experts under three scenarios: (a) no load balancing (one expert receives 60% of tokens), (b) load balancing with alpha=0.01 (roughly uniform), (c) excessive alpha=1.0 (perfectly uniform but degraded model quality). Caption: "The load balancing coefficient alpha controls the tradeoff between routing freedom and expert utilisation uniformity."]
Expert Capacity Factor
Even with load balancing, individual batches will have uneven routing. If an expert's buffer is full, tokens routed to it are dropped (their representation passes through unchanged, skipping the expert entirely). The capacity factor \(C\) determines how large each expert's buffer is relative to the perfectly balanced case:
\[\text{Expert buffer size} = C \cdot \frac{T}{N}\]where \(T\) is the batch token count and \(N\) the number of experts. The Switch Transformer recommends \(C \geq 1.25\). Larger values waste memory; smaller values drop more tokens. This is a direct tension: capacity factor is the padding between theory and practice in MoE routing.
Router Z-Loss
At large scale, router logits can grow unboundedly, causing numerical instability in the softmax computation. Zoph et al. (2022) introduced the router z-loss in their ST-MoE work (Zoph et al., 2022):
\[\mathcal{L}_{z} = \frac{1}{T} \sum_{x \in \text{Batch}} \left( \log \sum_{i=1}^{N} e^{h(x)_i} \right)^2\]This penalises large logit magnitudes, keeping the softmax numerically stable without constraining which experts get selected. The z-loss has been adopted in virtually every large-scale MoE training framework since its introduction.
Seeing It in Motion
Token Flow Through an MoE Transformer Layer
%%{init: {'theme': 'base', 'themeVariables': {'actorBkg': '#1e40af', 'actorTextColor': '#fff', 'actorBorder': '#3b82f6', 'signalColor': '#94a3b8', 'signalTextColor': '#e2e8f0', 'labelBoxBkgColor': '#1e293b', 'labelBoxBorderColor': '#334155', 'labelTextColor': '#e2e8f0', 'loopTextColor': '#e2e8f0', 'noteBkgColor': '#1e293b', 'noteTextColor': '#e2e8f0', 'noteBorderColor': '#475569', 'activationBorderColor': '#3b82f6', 'activationBkgColor': '#1e3a5f', 'fontSize': '16px'}}}%%
sequenceDiagram
participant T as Token x
participant A as Self-Attention
participant N as LayerNorm
participant R as Router
participant E2 as Expert 2
participant E5 as Expert 5
participant S as Weighted Sum
T->>A: Input representation
A->>N: Attention output + residual
N->>R: Normalized hidden state
R->>R: Compute h(x) = x · Wg
Note over R: Top-2 selection, k=2
R->>E2: Gate weight g2 = 0.72
R->>E5: Gate weight g5 = 0.28
E2->>S: E2(x) scaled by 0.72
E5->>S: E5(x) scaled by 0.28
S->>T: y = 0.72·E2(x) + 0.28·E5(x) + residual
Dense vs. Coarse-Grained vs. Fine-Grained MoE
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e40af', 'primaryTextColor': '#fff', 'primaryBorderColor': '#60a5fa', 'lineColor': '#94a3b8', 'textColor': '#e2e8f0', 'clusterBkg': '#1e293b', 'clusterBorder': '#334155', 'fontSize': '16px'}}}%%
flowchart TB
subgraph Dense["Dense Model"]
direction TB
D_IN["Token x"] --> D_FFN["Single FFN<br/>All parameters active"]
D_FFN --> D_OUT["Output y"]
end
subgraph Coarse["Coarse MoE (Mixtral-style)"]
direction TB
C_IN["Token x"] --> C_R["Router"]
C_R -->|"top-2"| C_E1["Expert 1<br/>Full-size FFN"]
C_R -->|"top-2"| C_E2["Expert 3<br/>Full-size FFN"]
C_R -.-> C_E3["Experts 2,4-8<br/>Inactive"]
C_E1 --> C_OUT["Weighted output"]
C_E2 --> C_OUT
end
subgraph Fine["Fine-Grained MoE (DeepSeek-style)"]
direction TB
F_IN["Token x"] --> F_S["Shared Expert<br/>Always active"]
F_IN --> F_R["Router"]
F_R -->|"top-8 of 256"| F_E1["Expert 12"]
F_R -->|"top-8 of 256"| F_E2["Expert 47"]
F_R -->|"top-8 of 256"| F_E3["Expert 198"]
F_R -.-> F_REST["248 experts<br/>Inactive"]
F_S --> F_OUT["Combined output"]
F_E1 --> F_OUT
F_E2 --> F_OUT
F_E3 --> F_OUT
end
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0
classDef rose fill:#be123c,stroke:#fb7185,stroke-width:1px,color:#fff
class D_IN,C_IN,F_IN blue
class D_FFN rose
class C_R,F_R amber
class C_E1,C_E2 purple
class F_S emerald
class F_E1,F_E2,F_E3 teal
class C_E3,F_REST slate
class D_OUT,C_OUT,F_OUT emerald
[IMAGE: Heatmap showing expert activation patterns across a sequence of tokens in Mixtral 8x7B. Rows represent experts (1-8), columns represent token positions, and color intensity shows gate weight. Demonstrates that different token types (punctuation, nouns, verbs, code tokens) tend to activate different expert combinations. Caption: "Expert activation heatmap for Mixtral 8x7B across a 64-token sequence. Note the clustering: syntactic tokens (punctuation, articles) tend to favour experts 1 and 4, while domain-specific tokens (code, math) route to experts 6 and 7."]
By the Numbers
The following table compares major MoE models against dense baselines, using publicly reported figures from official technical reports and papers.
| Model | Year | Total Params | Active Params | Experts | Active/Token | MMLU | Training Efficiency |
|---|---|---|---|---|---|---|---|
| LLaMA 2 70B (dense) | 2023 | 70B | 70B | 1 | 1/1 | 68.9 | Baseline |
| Mixtral 8x7B | 2023 | 47B | 13B | 8 | 2/8 | 70.6 | Matches 70B at 6x faster inference |
| Mixtral 8x22B | 2024 | 141B | 39B | 8 | 2/8 | 77.8 | Outperforms at 39B active cost |
| DBRX | 2024 | 132B | 36B | 16 | 4/16 | 73.7 | 2x faster than LLaMA 2 70B inference |
| Grok-1 | 2024 | 314B | ~79B | 8 | 2/8 | N/A | 25% of weights active per token |
| DeepSeek-V2 | 2024 | 236B | 21B | 160 | 6/160 | 78.5 | 42.5% training cost savings vs dense |
| DeepSeek-V3 | 2024 | 671B | 37B | 257 | 9/257 | 88.5 | $5.6M total, ~1/5 comparable dense cost |
| Switch-XXL | 2021 | 1,600B | ~13B | 2,048 | 1/2,048 | N/A | 4x speedup over T5-XXL |
| GPT-4 (rumoured) | 2023 | ~1,800B | ~220B | 8 | 2/8 | 86.4 | Unconfirmed MoE architecture |
| GShard | 2020 | 600B | ~15B | 2,048 | 2/2,048 | N/A | Trained on 2,048 TPUs in 4 days |
Sources: Mixtral (Jiang et al., 2024), DeepSeek-V2 (DeepSeek-AI, 2024), DeepSeek-V3 (DeepSeek-AI, 2024), DBRX (Databricks, 2024), Grok-1 (xAI, 2024), Switch Transformer (Fedus et al., 2021), GShard (Lepikhin et al., 2020). GPT-4 figures are from widely circulated leaks and have not been confirmed by OpenAI.
The pattern is consistent: MoE models achieve quality comparable to or better than dense models with 2-5x fewer active parameters per token. The cost savings compound because training a model with fewer active parameters requires proportionally fewer FLOPs per training step, even though the total parameter count (and thus memory) is larger.
[IMAGE: Scatter plot with total parameters on x-axis and MMLU score on y-axis. Dense models shown as circles, MoE models as stars. A regression line through dense models shows the expected scaling; MoE models cluster well above this line, achieving higher quality per total parameter. Caption: "MoE models consistently outperform the dense scaling curve, delivering more quality per parameter-dollar."]
The Scaling Law Advantage
Recent work on MoE scaling laws quantifies the advantage precisely. A compute-optimal MoE model trained with a budget of $10^{20}$ FLOPs achieves the same quality as a dense transformer trained with a 20x greater compute budget. This advantage grows with scale: at $10^{25}$ FLOPs, the savings exceed 40x (Scaling Laws Across Model Architectures, 2024). MoE models also demonstrate approximately 16% better data utilisation under similar compute budgets, achieving comparable performance with fewer training tokens.
A Concrete Example
Consider a single token, the word "gradient," passing through one MoE layer in a Mixtral 8x7B-style model. The model has 8 experts with top-2 routing.
Step 1: Router projection. The token's hidden representation \(x \in \mathbb{R}^{4096}\) is multiplied by the router weight matrix \(W_g \in \mathbb{R}^{4096 \times 8}\), producing 8 raw scores:
\[h(x) = [1.23, -0.41, 0.87, -1.55, 0.02, 2.31, -0.73, 0.94]\]Step 2: Top-2 selection. The two highest scores are at positions 5 (score 2.31) and 0 (score 1.23). All other positions are set to \(-\infty\).
Step 3: Softmax normalization. Over the two remaining scores:
\[G(x)_5 = \frac{e^{2.31}}{e^{2.31} + e^{1.23}} = \frac{10.07}{10.07 + 3.42} = 0.746\] \[G(x)_0 = \frac{e^{1.23}}{e^{2.31} + e^{1.23}} = \frac{3.42}{10.07 + 3.42} = 0.254\]Step 4: Expert computation. Expert 5 and Expert 0 each independently process \(x\) through their own FFN (two linear layers with a nonlinearity), producing \(E_5(x)\) and \(E_0(x)\), both in \(\mathbb{R}^{4096}\).
Step 5: Weighted aggregation.
\[y = 0.746 \cdot E_5(x) + 0.254 \cdot E_0(x)\]Step 6: Residual connection. The final output is \(y + x\), fed to the next layer.
The key observation: 6 of the 8 expert FFNs were never touched. Their weights contributed zero FLOPs to this token. For a model with 47B total parameters where each expert FFN accounts for roughly 5.6B parameters, only about 11.2B parameters of FFN weights were used (plus the shared attention layers). The compute matches a ~13B dense model; the capacity matches a 47B model.
[IMAGE: Diagram tracing a single token through the MoE layer with numerical annotations at each step. Shows the router scores as a bar chart, the top-2 selection, the softmax gate weights, and the weighted combination of two expert outputs merging back into the residual stream. Caption: "Complete forward pass of one token through a top-2 MoE layer. Six experts are never activated."]
Where It Breaks
Expert Collapse
The most common failure mode in MoE training. The router discovers that a few experts are slightly better early in training and begins routing more tokens to them. These experts receive more gradient updates, improve faster, and attract even more tokens. The positive feedback loop drives most tokens to 2-3 experts while the remaining experts receive negligible training signal and become "dead" parameters. In severe cases, a 64-expert model effectively degenerates into a 3-expert model.
Load balancing losses mitigate this, but they introduce a tension: the auxiliary loss gradient conflicts with the language modeling objective. Setting \(\alpha\) too high forces uniform routing regardless of token content, degrading model quality. Setting it too low allows collapse. DeepSeek-V3's auxiliary-loss-free approach sidesteps this by using a dynamic bias term on each expert's routing score, adjusted during training based on recent load, with a complementary sequence-level loss at an extremely small weight.
Memory Overhead
A 47B-parameter Mixtral model activates only 13B parameters per token, but all 47B parameters must reside in memory (or be available for rapid loading). For DeepSeek-V3, the gap is starker: 671B parameters in memory, 37B active. In FP16, 671B parameters require approximately 1.3 TB of memory. Quantisation (INT8 or INT4) helps, but the memory footprint still far exceeds what the active compute would suggest. This is the fundamental MoE serving challenge: you pay compute costs proportional to active parameters but memory costs proportional to total parameters.
Communication Overhead in Expert Parallelism
When experts are distributed across multiple GPUs (expert parallelism), every MoE layer requires an all-to-all communication step: tokens must be dispatched to whichever GPU holds their assigned expert, and results must be returned. This all-to-all pattern generates approximately 9x the communication volume of tensor parallelism for the same model (DeepSpeed-MoE, 2022). On NVLink interconnects (900 GB/s bidirectional), this is manageable. On PCIe Gen5 (~128 GB/s), it becomes the bottleneck.
Training Instability
Large router logits cause numerical overflow in softmax, leading to training spikes or NaN losses. The z-loss penalty (Zoph et al., 2022) constrains logit magnitudes, but tuning its coefficient relative to the main loss and the load balancing loss requires careful experimentation. MoE models are generally harder to train stably than dense models of equivalent quality, requiring more hyperparameter tuning and monitoring.
[IMAGE: Training loss curve comparison showing a dense model with smooth convergence versus an MoE model with periodic spikes and instabilities, annotated with labels pointing to "expert collapse recovery," "load rebalancing event," and "z-loss adjustment." Caption: "MoE training curves are typically noisier than dense equivalents. Spikes often correspond to routing instabilities that the auxiliary losses must correct."]
Fine-Tuning Brittleness
MoE models can be sensitive to fine-tuning. If only a subset of experts are relevant to the fine-tuning distribution, the router may collapse to those experts, losing the generalist capabilities stored in other experts. Techniques like freezing the router during fine-tuning, or using expert-level regularisation, are active areas of research.
Alternative Designs
| Design | Routing | Experts/Token | Load Balance | Key Advantage | Key Limitation |
|---|---|---|---|---|---|
| Dense Transformer | None | All parameters | N/A | Simple, stable training | Compute scales linearly with params |
| Top-K MoE (Shazeer 2017) | Token chooses top-k experts | Fixed k (usually 2) | Noise + auxiliary loss | Proven at scale, straightforward | Load imbalance, token dropping |
| Switch Transformer | Token chooses top-1 expert | 1 | Auxiliary loss | Maximum sparsity, simple routing | Expert underutilisation risk |
| Expert Choice (Zhou 2022) | Expert chooses top-k tokens | Variable per token | By construction | Perfect balance, no auxiliary loss | Variable compute per token, harder to batch |
| Fine-grained MoE (DeepSeek) | Token chooses top-k from many small experts | 6-8 of 160-256 | Bias-based, loss-free | Exponentially more expert combinations | Higher routing overhead |
| Hash MoE | Deterministic hash | 1-2 | By construction | No learned router, zero routing cost | Cannot learn content-aware routing |
The trend is clearly toward fine-grained designs with more, smaller experts. DBRX's 16-choose-4 configuration provides 65x more expert combinations than Mixtral's 8-choose-2. DeepSeek-V3's 256-choose-8 takes this further. Increasing the granularity of experts improves the model's expressivity exponentially while keeping sparsity unchanged (Fine-grained MoE Scaling Laws, 2024).
[IMAGE: Combinatorial explosion visualization showing the number of possible expert combinations: 8-choose-2 = 28, 16-choose-4 = 1,820, 256-choose-8 = 4.4 trillion. Each shown as a grid or lattice of dots with selected subsets highlighted. Caption: "Fine-grained expert routing exponentially increases the model's ability to compose different knowledge combinations per token."]
How It Is Used in Practice
Serving Architecture
Production MoE serving typically combines three parallelism strategies. Tensor parallelism splits individual expert FFNs across GPUs within a node (useful when a single expert is large). Expert parallelism distributes different experts to different GPUs, with all-to-all dispatch at each MoE layer. Data parallelism replicates the full model across nodes for throughput. The optimal mix depends on the ratio of expert size to interconnect bandwidth.
For Mixtral 8x7B, each expert FFN fits comfortably on a single GPU, so expert parallelism alone suffices with 8 GPUs per replica. For DeepSeek-V3 with 256 experts, hierarchical expert parallelism is necessary: experts are grouped across nodes, with intra-node NVLink for fast dispatch and inter-node InfiniBand for cross-group routing.
The Memory-Bandwidth Tradeoff
MoE inference is memory-bandwidth bound, not compute bound. During autoregressive generation, each token activates only a few experts, performing a small amount of compute relative to the data that must be read from memory. This means MoE models benefit less from GPU tensor cores and more from high-bandwidth memory (HBM). The inference cost per token for an MoE model is closer to the cost for a dense model the size of its total parameters (memory-bound) than a dense model the size of its active parameters (compute-bound), at least for small batch sizes. At large batch sizes, the compute-to-memory ratio improves, and MoE inference becomes more efficient relative to dense models.
Expert Offloading
For deployment on consumer hardware or cost-sensitive settings, expert offloading keeps only the attention layers and router in GPU memory, storing expert weights in CPU RAM or even SSD. When the router selects experts for a token, the required weights are loaded on demand. This trades latency for memory: a 47B Mixtral model can run on a single 24GB GPU with expert offloading, at the cost of significant per-token latency. Prefetching strategies (predicting which experts the next token will need based on the current token's routing pattern) can partially hide this latency.
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e40af', 'primaryTextColor': '#fff', 'primaryBorderColor': '#60a5fa', 'lineColor': '#94a3b8', 'textColor': '#e2e8f0', 'clusterBkg': '#1e293b', 'clusterBorder': '#334155', 'fontSize': '16px'}}}%%
flowchart TB
subgraph GPU["GPU Memory (24 GB)"]
ATT["Attention layers + Router"]
BUF["Expert buffer (2 experts)"]
end
subgraph CPU["CPU RAM (64 GB)"]
E1["Expert 1 weights"]
E2["Expert 2 weights"]
E3["Expert 3 weights"]
E4["Experts 4-8 weights"]
end
ATT -->|"Router selects E2, E5"| BUF
E2 -->|"Load on demand"| BUF
E4 -->|"Load on demand"| BUF
BUF -->|"Compute, evict"| ATT
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0
class ATT,BUF blue
class E1,E2,E3,E4 slate
[IMAGE: Performance graph showing tokens/second versus GPU count for Mixtral 8x7B inference, comparing expert parallelism (steep improvement from 1 to 8 GPUs) versus tensor parallelism (moderate improvement) versus expert offloading (flat but functional at 1 GPU). Caption: "Serving strategy dramatically affects MoE throughput. Expert parallelism scales near-linearly up to the expert count."]
Insights Worth Remembering
-
The ratio matters more than the count. A model with 256 experts activating 8 is not inherently better than one with 8 experts activating 2. What matters is the ratio of total to active parameters, the granularity of expert combinations, and how well the router has learned to specialise them.
-
MoE does not reduce memory requirements. It reduces compute requirements. This distinction is critical for inference deployment planning. Your GPU memory budget must accommodate total parameters, not active parameters.
-
Load balancing is an unsolved optimisation problem. Every approach (auxiliary losses, z-loss, expert choice, bias-based balancing) involves tradeoffs between routing quality and utilisation uniformity. DeepSeek's loss-free approach is currently the most elegant, but it requires careful tuning of the bias update speed.
-
Expert specialisation is emergent, not designed. Nobody programs Expert 3 to handle mathematics. The router and experts co-evolve during training, and the resulting specialisations are often surprising. Some experts specialise by syntax (handling punctuation and structure), others by domain (code, multilingual text), and some by position in the sequence.
-
Top-1 routing works better than expected. The Switch Transformer showed that activating a single expert per token, the most extreme sparsity, still produces strong results. The intuition that "more active experts equals better" does not hold linearly.
-
Fine-grained experts compound combinatorially. Going from 8-choose-2 (28 combinations) to 16-choose-4 (1,820 combinations) to 256-choose-8 (4.4 trillion combinations) gives the model exponentially more ways to compose knowledge for different inputs. This is a stronger lever than simply making each expert larger.
-
MoE training is 2x more FLOP-efficient than dense training, but the advantage widens with scale. At $10^{20}$ FLOPs, MoE saves ~20x over dense. At $10^{25}$ FLOPs, the savings exceed 40x. The larger the compute budget, the more MoE dominates.
-
Shared experts are surprisingly important. DeepSeek's architecture dedicates one expert as "always active" across all tokens, learning common linguistic patterns. This frees the routed experts to specialise more aggressively, improving overall quality.
-
The $5.6M training cost for DeepSeek-V3 is misleading in isolation. The research cost (failed experiments, architecture search, infrastructure development) is much higher. The $5.6M figure represents the final successful training run on a known-good configuration. It is still remarkable, as it demonstrates that MoE can achieve frontier quality at a fraction of the compute cost of dense alternatives.
-
Expert parallelism communication cost is the binding constraint at scale. The all-to-all dispatch pattern at every MoE layer generates 9x more communication than tensor parallelism. Hierarchical expert parallelism, topology-aware routing, and prefetching are active engineering frontiers.
Open Questions
Can MoE models be distilled into dense models without losing quality? If the specialised knowledge in 256 experts can be compressed into a single dense network with acceptable quality loss, it would give the best of both worlds: MoE training efficiency and dense inference simplicity. Early results suggest 5-15% quality degradation, which may or may not be acceptable depending on the application.
Will expert routing become content-addressable? Current routers use a single linear projection. Richer routing (attention-based, multi-step, or retrieval-augmented) could improve expert selection at the cost of routing overhead. The optimal complexity of the router itself is unknown.
Is there a natural limit to expert count? DeepSeek-V3 uses 256 routed experts. The combinatorial advantage of finer granularity suggests going higher, but diminishing returns, routing overhead, and memory constraints impose practical limits. Where that limit falls, and whether it shifts with hardware improvements, remains to be determined.
How should MoE models be fine-tuned? Full fine-tuning risks expert collapse. LoRA on all experts is expensive in adapter count. Routing-aware fine-tuning (adjusting only the experts that the router selects for the fine-tuning distribution) is promising but underexplored. The community has not converged on a best practice.
Will hardware co-design change the tradeoffs? Custom silicon optimised for sparse all-to-all communication (rather than dense matrix multiplication) could shift the balance further toward MoE architectures. Google's TPU interconnect topology already favours MoE workloads; similar specialisation in GPU clusters may follow.
Sources and Further Reading
-
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). "Adaptive Mixtures of Local Experts." Neural Computation, 3(1), 79-87. Semantic Scholar
-
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." arXiv:1701.06538
-
Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., & Chen, Z. (2020). "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding." arXiv:2006.16668
-
Fedus, W., Zoph, B., & Shazeer, N. (2021). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." JMLR, 23(120), 1-39. arXiv:2101.03961
-
Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N., & Fedus, W. (2022). "ST-MoE: Designing Stable and Transferable Sparse Expert Models." arXiv:2202.08906
-
Zhou, Y., Lei, T., Liu, H., Du, N., Huang, Y., Zhao, V., Dai, A., Chen, Z., Le, Q., & Laudon, J. (2022). "Mixture-of-Experts with Expert Choice Routing." NeurIPS 2022. arXiv:2202.09368
-
Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., de las Casas, D., Hanna, E. B., Bressand, F., et al. (2024). "Mixtral of Experts." arXiv:2401.04088
-
Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., et al. (2024). "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models." ACL 2024. arXiv:2401.06066
-
DeepSeek-AI. (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434
-
DeepSeek-AI. (2024). "DeepSeek-V3 Technical Report." arXiv:2412.19437
-
Databricks. (2024). "Introducing DBRX: A New State-of-the-Art Open LLM." Databricks Blog
-
xAI. (2024). "Open Release of Grok-1." GitHub
-
Wang, B., Hua, Y., Shi, Z., & Nakov, P. (2024). "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts." arXiv:2408.15664
-
Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R. Y., Awan, A. A., Rasley, J., & He, Y. (2022). "DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale." ICML 2022. Paper
-
Yun, S., Krishnamurthy, S., & Hestness, J. (2024). "Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models." arXiv:2410.05661