Researchers have developed Muon$^2$, an enhanced version of the Muon optimizer designed for large-scale foundation model pre-training. Muon$^2$ improves efficiency and quality by incorporating Adam-style adaptive second-moment preconditioning before orthogonalization, addressing the computational costs associated with Muon's iterative orthogonalization process. Experiments with GPT, LLaMA, and Mixture-of-Experts models up to 13B parameters show that Muon$^2$ reduces the need for Newton-Schulz iterations by 40% and can save up to a quarter of training time compared to Muon while achieving similar loss. AI
IMPACT Muon^2 offers a more efficient training process for large foundation models, potentially reducing computational costs and accelerating development cycles.
RANK_REASON The cluster contains two academic papers detailing advancements in optimization algorithms for large-scale model training.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →