Weight norm's role in neural network grokking clarified

By PulseAugur Editorial · [1 sources] · 2026-06-18 04:00

Researchers have investigated the phenomenon of 'grokking' in neural networks, where a model transitions from memorization to generalization. Their findings indicate that the weight norm, previously thought to be the primary driver of this transition, primarily acts as an upstream handle for the logit scale. By manipulating the logit scale directly, researchers could control the grokking delay across its entire range, with the weight norm contributing only a minor additional effect. This relationship was found to be dependent on the loss function used, with mean-squared error showing a different mechanism than cross-entropy. AI

IMPACT Clarifies the underlying mechanisms of generalization in neural networks, potentially informing future model architectures and training strategies.

RANK_REASON The item is an academic paper detailing research findings on a machine learning phenomenon. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Truong Xuan Khanh · 2026-06-18 04:00

What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy

arXiv:2606.18465v1 Announce Type: cross Abstract: Grokking, the delayed jump from memorization to generalization, is usually tied to the weight norm: a smaller norm generalizes sooner. We ask what the norm actually controls. Holding the weight norm fixed by clamping and varying o…

COVERAGE [1]

What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy

RELATED ENTITIES

RELATED TOPICS