PulseAugur
EN
LIVE 05:39:15

Weight decay controls transformer training regimes, new diagnostics revealed

Researchers have identified weight decay as a key parameter controlling the training regimes of transformers on modular arithmetic tasks. They introduced two new, low-cost online diagnostics—mean pairwise attention-head cosine similarity and entropy standard deviation—to monitor training dynamics from attention activations. These diagnostics, applied across various experimental conditions and model scales, effectively distinguish between memorization, generalization (grokking), and collapse, with specific transition points identified for the memorization-to-developmental boundary. AI

IMPACT Provides new methods for understanding and controlling transformer behavior during training, potentially leading to more efficient and effective model development.

RANK_REASON The cluster contains an academic paper detailing new research findings on transformer training dynamics.

Read on arXiv cs.NE (Neural & Evolutionary) →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Lucky Verma ·

    Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

    arXiv:2605.20441v1 Announce Type: cross Abstract: Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two c…

  2. arXiv cs.NE (Neural & Evolutionary) TIER_1 English(EN) · Lucky Verma ·

    Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

    Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two cheap online diagnostics, mean pairwise attention-h…