Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

By PulseAugur Editorial · Summary by None from 2 sources

A new theory, the Norm-Separation Delay Law, explains the phenomenon of grokking, where models generalize long after memorizing training data. Researchers demonstrated that grokking is a representational phase transition driven by norms, and established a mathematical relationship between the delay and factors like weight decay and learning rate. This work reframes grokking as a predictable outcome of norm separation and offers a predictive algorithm for grokking delay. AI

Summary written by None from 2 sources. How we write summaries →

IMPACT Provides a theoretical framework for understanding and predicting model generalization delays, potentially enabling more efficient training.

RANK_REASON The cluster contains two arXiv papers presenting theoretical analyses of machine learning optimization algorithms and phenomena.

Read on arXiv cs.LG →

paper
other

COVERAGE [2]

arXiv cs.LG TIER_1 · Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, Phan Thanh Duc · 2026-05-05 04:00

The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization

arXiv:2603.13331v2 Announce Type: replace-cross Abstract: Grokking -- the sudden generalisation that appears long after a model has perfectly memorised its training data -- has been widely observed but lacks a quantitative theory explaining the length of the delay. We show that g…
arXiv cs.LG TIER_1 · Huan Li, Yiming Dong, Zhouchen Lin · 2026-05-04 04:00

Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

arXiv:2601.07326v2 Announce Type: replace-cross Abstract: This paper studies the AdamW-style Shampoo optimizer, an effective implementation of classical Shampoo that notably won the external tuning track of the AlgoPerf neural network training algorithm competition. Our analysis …

COVERAGE [2]

The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization

Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

RELATED ENTITIES

RELATED TOPICS