A recent paper poses an open problem regarding the effectiveness of the AdamW optimizer in training large language models (LLMs) under heavy-tailed noise conditions. While AdamW is widely used, its theoretical understanding is limited to finite-variance scenarios, despite empirical evidence suggesting heavy-tailed noise is common in LLM pretraining. The paper explores whether AdamW can converge in this regime, contrasting it with other optimizers like Lion and Muon that have shown convergence under heavy-tailed noise, and provides a weighted-metric benchmark and a lower-bound mechanism. AI
IMPACT Clarifies theoretical limitations of a widely used LLM training optimizer, potentially guiding future research into more robust methods.
RANK_REASON The cluster contains an academic paper detailing an open problem in machine learning optimization. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →