Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics
Researchers have identified weight decay as a key parameter controlling the training regimes of transformers on modular arithmetic tasks. They introduced two new, low-cost online diagnostics—mean pairwise attention-head cosine similarity and entropy standard deviation—to monitor training dynamics from attention activations. These diagnostics, applied across various experimental conditions and model scales, effectively distinguish between memorization, generalization (grokking), and collapse, with specific transition points identified for the memorization-to-developmental boundary. AI
IMPACT Provides new methods for understanding and controlling transformer behavior during training, potentially leading to more efficient and effective model development.