Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection
Published at
arXiv
2024

Abstract
Mixture-of-Experts (MoE) architectures have recently gained popularity in enabling efficient scaling of large
language models. However, we uncover a fundamental tension: while MoEs are designed for selective expert
activation, production serving requires request batching, which forces the activation of all experts and negates
MoE’s efficiency benefits during the decode phase. We present LYNX, a system that enables efficient MoE
inference through dynamic, batch-aware expert selection. Our key insight is that expert importance varies
significantly across tokens and inference phases, creating opportunities for runtime optimization. LYNX leverages
this insight through a lightweight framework that dynamically reduces active experts while preserving model
accuracy. Our evaluations show that LYNX achieves up to 1.55× reduction in inference latency while maintaining
negligible accuracy loss from baseline model across complex code generation and mathematical reasoning tasks.