Brief · PulseAugur

RESEARCH · arXiv cs.AI English(EN) · 7h · [2 sources]

Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning

Researchers have developed Muon$^2$, an enhanced version of the Muon optimizer designed for large-scale foundation model pre-training. Muon$^2$ improves efficiency and quality by incorporating Adam-style adaptive second-moment preconditioning before orthogonalization, addressing the computational costs associated with Muon's iterative orthogonalization process. Experiments with GPT, LLaMA, and Mixture-of-Experts models up to 13B parameters show that Muon$^2$ reduces the need for Newton-Schulz iterations by 40% and can save up to a quarter of training time compared to Muon while achieving similar loss. AI

IMPACT Muon^2 offers a more efficient training process for large foundation models, potentially reducing computational costs and accelerating development cycles.

Mixture-of-Experts
GPT
LLaMA
Muon
Naoki Sato
Ziyue Liu
Muon$^2$