Grouter: Decoupling Routing from Representation for Accelerated MoE Training
Researchers have introduced Grouter, a novel method for training Mixture-of-Experts (MoE) models that decouples the routing policy from the expert weights. This approach accelerates convergence and improves training stability by using a fixed router derived from pre-trained MoE models. Grouter also incorporates expert folding and tuning to adapt to different model configurations and data distributions, leading to significant gains in pre-training data utilization and throughput acceleration. AI
IMPACT Accelerates MoE training and improves data utilization, potentially lowering costs for large model development.