English(EN) Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

新论文详述交叉熵训练如何塑造Transformer的注意力机制

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-19 04:00

研究人员分析了交叉熵训练如何塑造Transformer注意力头中的注意力分数和值向量。他们的工作引入了一种基于优势的注意力分数路由法则以及一种负责任加权的更新值机制。这种机制创造了一个反馈循环，其中查询和值共同专业化，使Transformer能够执行精确的概率推理。 AI

影响解释了使Transformer能够进行概率推理的内部几何结构，为模型可解释性提供了见解。

排序理由该集群包含一篇详细介绍新研究发现的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv stat.ML TIER_1 English(EN) · Naman Agarwal, Siddhartha R. Dalal, Vishal Misra · 2026-05-19 04:00

Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

arXiv:2512.22473v5 Announce Type: replace Abstract: Transformers empirically perform precise probabilistic reasoning in carefully constructed ``Bayesian wind tunnels'' and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required int…