English(EN) Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

超越注意力投影的线性：非线性查询的论证

作者 PulseAugur 编辑部 · [11 个来源] · 2026-04-27 04:00

研究人员正在探索 Transformer 注意力机制背后的基本原理，新论文分析了其梯度流结构和动态。一项研究将注意力解释为单位球面上的梯度流，识别影响多头设置中 token 聚类和稳定性的因素。另一篇论文研究了用于复杂性控制的关键训练窗口，确定 Transformer 何时优先考虑推理而非记忆。此外，研究还揭示了深度神经网络中几何连续性的起源，将其归因于残差连接和对称性破坏的非线性，并考察了“注意力汇聚”现象的结构原因。 AI

影响这些理论分析提供了对 Transformer 行为的更深入见解，可能指导未来架构的改进和训练策略，以构建更高效、更强大的模型。

排序理由多篇 arXiv 论文发表了关于 Transformer 注意力机制和训练动态的理论方面的内容。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 11 个来源。我们如何撰写摘要 →

报道来源 [11]

arXiv cs.LG TIER_1 English(EN) · Siquan Li, Kaiqi Jiang, Jiacheng Sun, Tianyang Hu · 2026-05-08 04:00

Attention Sink 的结构起源：方差差异、超级神经元和维度差异

arXiv:2605.06611v1 Announce Type: new Abstract: Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechan…
arXiv cs.LG TIER_1 English(EN) · Ayan Pendharkar · 2026-05-07 04:00

多头自注意力机制的梯度流结构与量化动力学

arXiv:2605.04279v1 Announce Type: new Abstract: Transformer self-attention can be interpreted as a gradient flow on the unit sphere, in which tokens evolve under softmax interaction potentials and tend to form clusters. While prior work has established clustering behavior for sin…
arXiv cs.LG TIER_1 English(EN) · Sarwan Ali · 2026-05-07 04:00

复杂性控制的关键窗口：Transformer模型何时选择推理或记忆

arXiv:2605.04396v1 Announce Type: new Abstract: Recent work has shown that Transformers' compositional generalization is governed by \emph{complexity control}, initialization scale and weight decay, which steers training toward low-complexity reasoning solutions rather than high-…
arXiv cs.LG TIER_1 English(EN) · Kyungwon Jeong, Won-Gi Paeng, Honggyo Suh · 2026-05-07 04:00

深度神经网络中为何出现几何连续性：残差连接与旋转对称性破缺

arXiv:2605.04971v1 Announce Type: new Abstract: Weight matrices in deep networks exhibit geometric continuity -- principal singular vectors of adjacent layers point in similar directions. While this property has been widely observed, its origin remains unexplained. Through experi…
arXiv cs.CL TIER_1 English(EN) · Honggyo Suh · 2026-05-06 14:27

深度神经网络中为何出现几何连续性：残差连接与旋转对称性破缺

Weight matrices in deep networks exhibit geometric continuity -- principal singular vectors of adjacent layers point in similar directions. While this property has been widely observed, its origin remains unexplained. Through experiments on toy MLPs and small transformers, we ide…
arXiv cs.LG TIER_1 English(EN) · Zheng-An Chen, Pengxiao Lin, Zhi-Qin John Xu, Tao Luo · 2026-05-05 04:00

专注与稀释：注意力机制的多阶段学习过程

arXiv:2605.01199v1 Announce Type: new Abstract: Transformer-based models have achieved remarkable success across a wide range of domains, yet our understanding of their training dynamics remains limited. In this work, we identify a recurrent focus-dilution cycle in attention lear…
arXiv cs.LG TIER_1 English(EN) · Marko Karbevski · 2026-04-27 04:00

超越注意力投影的线性：非线性查询的论证

arXiv:2603.13381v2 Announce Type: replace Abstract: Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X…
arXiv stat.ML TIER_1 English(EN) · Tianyang Hu · 2026-05-07 17:28

Attention Sink 的结构起源：方差差异、超级神经元和维度差异

Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, w…
arXiv stat.ML TIER_1 English(EN) · Jerry Yao-Chieh Hu, Mingcheng Lu, Yi-Chen Lee, Han Liu · 2026-04-29 04:00

基于ReLUs的Transformer近似

arXiv:2604.24878v1 Announce Type: cross Abstract: We provide a systematic recipe for translating ReLU approximation results to softmax attention mechanism. This recipe covers many common approximation targets. Importantly, it yields target-specific, economic resource bounds beyon…
arXiv stat.ML TIER_1 English(EN) · Han Liu · 2026-04-27 18:04

基于ReLUs的Transformer近似

We provide a systematic recipe for translating ReLU approximation results to softmax attention mechanism. This recipe covers many common approximation targets. Importantly, it yields target-specific, economic resource bounds beyond universal approximation statements. We showcase …
Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-04-29 21:20

注意力到底在关注什么？10分钟Manim演示Query、Key、Value、softmax、多头注意力，以及为什么长上下文会变得昂贵

What is attention actually paying attention to? A 10-minute Manim walkthrough of Query, Key, Value, softmax, multi-head attention, and why long context gets expensive. Watch: https:// youtu.be/nFyr1tx2C-E Mirror: https:// attention-mechanism-20260430.v ercel.app/attention_mechani…

链接 youtube.com/watch

报道来源 [11]

相关实体

相关话题