The Core Insight: Conditional Computation
Dense neural networks activate all their parameters for every input token. MoE introduces sparsity: only a subset of specialized expert networks activates for each token, chosen by a learned router network.
The Math of Efficiency in 2026
A MoE model with 8 experts and top-2 routing activates 25% of parameters per token. This means 4x more total parameters than a dense model for the same computational cost during inference.
Mistral's Mixtral: Proving the Concept at Scale
Mixtral 8x7B demonstrated definitively that MoE could match dense 70B+ parameter models at a 7B parameter inference cost. The race to apply MoE to progressively larger scales began immediately in every major AI lab.
"MoE isn't a trick — it's a fundamentally better way to scale intelligence efficiently. We expect it to dominate the frontier architecture space through 2027 and beyond."
Frequently Asked Questions
What is mixture of experts in AI?
Mixture of Experts (MoE) uses different subnetworks (experts) that specialise in different inputs. A gating network routes each input to only a subset of experts, enabling much larger total parameters while keeping inference compute similar to a smaller dense model.
Does GPT-4 use mixture of experts?
GPT-4 reportedly uses a MoE architecture with approximately 1.8 trillion total parameters across 16 expert groups, activating roughly 220 billion parameters per forward pass.
What is the advantage of MoE over dense models?
MoE achieves better performance per unit of active compute via specialisation. MoE models can have 10× more total parameters than dense models while using similar inference FLOPs. The tradeoff is higher memory requirements and more complex training.