Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 1d

DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training

Researchers have introduced DTop-p MoE, a novel routing mechanism for sparse Mixture-of-Experts (MoE) architectures used in foundation model pre-training. This method dynamically adjusts the Top-p probability threshold using a Proportional-Integral controller and layer-wise expert selection under a global sparsity constraint. Experiments show DTop-p MoE outperforms standard Top-k and fixed Top-p methods in Large Language Models and Diffusion Transformers, while maintaining comparable computational costs. AI

IMPACT Introduces a more efficient routing mechanism for MoE architectures, potentially improving training scalability and performance for large models.

Mixture-of-Experts
Large Language Models
Diffusion Transformers
Can Jin
DTop-p MoE