English(EN) Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

权重衰减控制 Transformer 训练机制，揭示新的诊断方法

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-19 19:48

研究人员发现，在模块化算术任务上，权重衰减是控制 Transformer 训练机制的关键参数。他们引入了两种新的、低成本的在线诊断方法——平均成对注意力头余弦相似度和熵标准差——以监测注意力激活的训练动态。这些诊断方法应用于各种实验条件和模型规模，能有效区分记忆、泛化（grokking）和崩溃，并确定了记忆到发展的边界的具体过渡点。 AI

影响提供了理解和控制 Transformer 训练行为的新方法，有望带来更高效、更有效的模型开发。

排序理由该集群包含一篇学术论文，详细介绍了关于 Transformer 训练动态的新研究成果。

在 arXiv cs.NE (Neural & Evolutionary) 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Lucky Verma · 2026-05-22 04:00

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

arXiv:2605.20441v1 Announce Type: cross Abstract: Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two c…
arXiv cs.NE (Neural & Evolutionary) TIER_1 English(EN) · Lucky Verma · 2026-05-19 19:48

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two cheap online diagnostics, mean pairwise attention-h…

报道来源 [2]

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

相关实体

相关话题