A new theory, the Norm-Separation Delay Law, explains the phenomenon of grokking, where models generalize long after memorizing training data. Researchers demonstrated that grokking is a representational phase transition driven by norms, and established a mathematical relationship between the delay and factors like weight decay and learning rate. This work reframes grokking as a predictable outcome of norm separation and offers a predictive algorithm for grokking delay. AI
Summary written by None from 2 sources. How we write summaries →
IMPACT Provides a theoretical framework for understanding and predicting model generalization delays, potentially enabling more efficient training.
RANK_REASON The cluster contains two arXiv papers presenting theoretical analyses of machine learning optimization algorithms and phenomena.