English(EN) Characterizing the Expressivity of Local Attention in Transformers

研究人员通过注意力控制和算法捕获探索高效 Transformer

作者 PulseAugur 编辑部 · [9 个来源] · 2026-05-01 16:30

研究人员正在探索提高 Transformer 效率和理解力的方法。一篇论文引入了预算注意力分配（Budgeted Attention Allocation），这是一种允许成本-质量权衡的头门控机制。另一项研究定义了 Transformer 中的算法捕获（algorithmic capture），并分析了它们的计算复杂性，表明存在一种归纳偏见，反对更高复杂度的过程。此外，关于 Transformer 中局部注意力的工作证明了其表达能力以及与全局注意力的互补性，有可能提高模型质量。最后，研究调查了注意力汇聚（attention sinks）如何在反向传播过程中导致梯度汇聚（gradient sinks），而大规模激活则充当调节器。 AI

影响这些研究为 Transformer 效率、计算复杂性和训练动态提供了理论和实证见解，可能指导未来的模型开发。

排序理由多篇 arXiv 论文提出了关于 Transformer 架构、效率和计算特性的新研究。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 9 个来源。我们如何撰写摘要 →

报道来源 [9]

arXiv cs.LG TIER_1 English(EN) · Amrit Nidhi · 2026-05-08 04:00

预算注意力分配：成本条件计算控制以实现高效Transformer

arXiv:2605.05697v1 Announce Type: new Abstract: Transformers usually expose one inference cost per trained model, while deployed systems often need multiple cost-quality operating points. We study Budgeted Attention Allocation, a monotone head-gating mechanism conditioned on a re…
arXiv cs.LG TIER_1 English(EN) · Orit Davidovich, Zohar Ringel · 2026-05-08 04:00

算法任务捕获、计算复杂度和无限Transformer的归纳偏置

arXiv:2603.11161v2 Announce Type: replace Abstract: We formally define algorithmic capture of combinatorial tasks as the ability of a transformer to extrapolate to arbitrary task sizes with controllable error and logarithmic sample adaptation, providing a sharp scaling criterion …
arXiv cs.LG TIER_1 English(EN) · Lena Ehrmuth, Laura Strieker · 2026-05-07 04:00

Average Attention Transformers and Arithmetic Circuits

arXiv:2605.04683v1 Announce Type: cross Abstract: We analyse the computational power of transformer encoders as sequence-to-sequence functions on vectors. We show that average hard attention can be used to simulate arithmetic circuits if they are given as an input to an encoder. …
arXiv cs.LG TIER_1 English(EN) · Yihong Chen, Zhouchen Lin, Quanming Yao · 2026-05-07 04:00

注意力下沉诱发梯度下沉：Transformer中大规模激活作为梯度调节器

arXiv:2603.17771v2 Announce Type: replace Abstract: Attention sinks and massive activations are recurring and closely related phenomena in Transformer models. Existing explanations have largely focused on the forward pass, yet in pre-norm Transformers, large residual-stream norms…
arXiv cs.AI TIER_1 English(EN) · Laura Strieker · 2026-05-06 09:35

Average Attention Transformers and Arithmetic Circuits

We analyse the computational power of transformer encoders as sequence-to-sequence functions on vectors. We show that average hard attention can be used to simulate arithmetic circuits if they are given as an input to an encoder. The circuit families that can be simulated this wa…
arXiv cs.LG TIER_1 English(EN) · Stephen J. Thomas · 2026-05-06 04:00

Cascade Token Selection for Transformer Attention Acceleration

arXiv:2605.03110v1 Announce Type: new Abstract: A method is presented for reducing the cost of representative token selection in transformer attention layers by exploiting the coherence of the representative set across depth. Activation Decorrelation Attention (ADA) selects $r \l…
arXiv cs.CL TIER_1 English(EN) · Jiaoda Li, Ryan Cotterell · 2026-05-04 04:00

表征Transformer中局部注意力机制的表达能力

arXiv:2605.00768v1 Announce Type: new Abstract: The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, which lets the model aggregate information from all preceding tokens before generat…
arXiv cs.CL TIER_1 English(EN) · Ryan Cotterell · 2026-05-01 16:30

Transformer中局部注意力表达能力的表征

The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, which lets the model aggregate information from all preceding tokens before generating the next token. One common variant of attent…
dev.to — LLM tag TIER_1 English(EN) · Rijul Rajesh · 2026-05-05 19:25

理解仅解码器Transformer（第一部分）：掩码自注意力机制

<h2> Decoder-Only Transformers </h2> <p>In this article, we will explore <strong>decoder-only transformers</strong>.</p> <p>Decoder-only transformers are a specific type of transformer architecture used in systems like ChatGPT.</p> <h2> Masked Self-Attention </h2> <p>Decoder-only…

报道来源 [9]

相关实体

相关话题