The Weight Norm Sets the Grokking Timescale: A Causal Delay Law
Researchers have investigated the phenomenon of "grokking" in neural networks, where generalization occurs significantly after the model has already fit the training data. Their study suggests that the weight norm plays a crucial role in this delayed generalization. By intervening and manipulating the weight norm during training, they found that a specific critical norm value, Wc, is consistently reached, and this value scales with the network's modular base as a power law. Furthermore, they observed that holding the norm at a fixed multiple of Wc results in a grokking delay that follows an exponential relationship with the norm multiple. AI