PulseAugur
EN
LIVE 21:16:29

Muon optimizer shows superior feature learning over Adam

A new research paper and accompanying analysis explore the performance advantages of the Muon optimizer over Adam, particularly in the training of large language models and vision classifiers. Studies indicate that Muon learns more robust and transferable features, showing better performance on corrupted data and improved transferability to downstream tasks. This superiority is attributed to Muon's ability to reduce curvature penalties by maintaining lower normalized directional sharpness, especially in later stages of training, an effect amplified by data imbalance. AI

IMPACT Muon's demonstrated ability to learn more robust and transferable features could lead to more efficient and effective training of future large language models and AI systems.

RANK_REASON The cluster contains academic papers detailing novel research findings on AI model optimization.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

COVERAGE [3]

  1. arXiv cs.AI TIER_1 English(EN) · Tianyu Ruan, Fengzhuo Zhang, Shuche Wang, Shihua Zhang ·

    Muon Learns More Robust and Transferable Features than Adam

    arXiv:2606.09658v1 Announce Type: cross Abstract: Muon has recently emerged as a state-of-the-art optimizer for pretraining Large Language Models (LLMs) and vision classifiers. Despite its efficiency advantage over Adam and SGD, the feature-learning advantage of Muon remains uncl…

  2. arXiv cs.AI TIER_1 English(EN) · Shihua Zhang ·

    Muon Learns More Robust and Transferable Features than Adam

    Muon has recently emerged as a state-of-the-art optimizer for pretraining Large Language Models (LLMs) and vision classifiers. Despite its efficiency advantage over Adam and SGD, the feature-learning advantage of Muon remains unclear. This paper investigates Muon's feature-learni…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    Why Muon Outperforms Adam: A Curvature Perspective

    Muon outperforms Adam in large language model training by reducing curvature penalties through lower normalized directional sharpness, particularly in middle and late training stages, with advantages amplified by data imbalance and heterogeneous curvature.