DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training
Researchers have introduced DTop-p MoE, a novel routing mechanism for sparse Mixture-of-Experts (MoE) architectures used in foundation model pre-training. This method dynamically adjusts the Top-p probability threshold using a Proportional-Integral controller and layer-wise expert selection under a global sparsity constraint. Experiments show DTop-p MoE outperforms standard Top-k and fixed Top-p methods in Large Language Models and Diffusion Transformers, while maintaining comparable computational costs. AI
IMPACT Introduces a more efficient routing mechanism for MoE architectures, potentially improving training scalability and performance for large models.