English(EN) The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

研究发现 Transformer 的 grokking 延迟与解码器瓶颈有关

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-18 04:00

一篇新的研究论文探讨了 Transformer 中的“grokking”现象，即模型在算法任务训练过程中，经过长时间延迟后会突然泛化。研究表明，这种延迟源于对学习到的结构的访问受限，而不是无法获取它们。通过分析一步科拉兹预测，研究人员发现，虽然编码器能快速学习到相关结构，但解码器瓶颈延长了泛化阶段。移植训练好的编码器或冻结编码器并重新训练解码器等干预措施显著加速了学习并提高了准确性，数字表征也起着至关重要的作用。 AI

影响为理解 Transformer 学习动态提供了见解，可能为未来模型架构和训练策略的改进提供信息，以提高效率。

排序理由研究论文，详细介绍了 Transformer 模型行为的发现。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Laura Gomezjurado Gonzalez · 2026-06-18 04:00

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

arXiv:2604.13082v2 Announce Type: replace-cross Abstract: Grokking in transformers trained on algorithmic tasks is characterized by a long delay between training-set fit and abrupt generalization, but the source of that delay remains poorly understood. In encoder-decoder arithmet…

报道来源 [1]

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

相关实体

相关话题