English(EN) Uniform Scaling Limits in AdamW-Trained Transformers

论文详细介绍了 AdamW 训练的 Transformer 中的均匀缩放限制

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-11 16:54

研究人员发表了一篇论文，详细介绍了使用 AdamW 优化器训练的 Transformer 中的均匀缩放限制。该研究将隐藏状态动力学建模为一个相互作用的粒子系统，证明了其收敛到一个 ODE 的前向-后向系统。这种收敛速率取决于 Transformer 的深度和头数，并推导出了独立于 token 数量和嵌入维度的特定数学界限。 AI

影响为 Transformer 缩放提供了理论见解，可能为未来的模型设计和训练策略提供信息。

排序理由一篇在 arXiv 上发表的学术论文，详细介绍了关于 Transformer 模型缩放的理论发现。

在 arXiv stat.ML 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv stat.ML TIER_1 English(EN) · William Gibson, Christoph Reisinger · 2026-05-13 04:00

Uniform Scaling Limits in AdamW-Trained Transformers

arXiv:2605.11059v1 Announce Type: new Abstract: We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention he…
arXiv stat.ML TIER_1 English(EN) · Christoph Reisinger · 2026-05-11 16:54

Uniform Scaling Limits in AdamW-Trained Transformers

We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention heads, we prove that the joint dynamics of the hid…

报道来源 [2]

Uniform Scaling Limits in AdamW-Trained Transformers

Uniform Scaling Limits in AdamW-Trained Transformers

相关实体

相关话题