PulseAugur
实时 22:05:59
English(EN) Characterizing the Expressivity of Local Attention in Transformers

研究人员通过注意力控制和算法捕获探索高效 Transformer

研究人员正在探索提高 Transformer 效率和理解力的方法。一篇论文引入了预算注意力分配(Budgeted Attention Allocation),这是一种允许成本-质量权衡的头门控机制。另一项研究定义了 Transformer 中的算法捕获(algorithmic capture),并分析了它们的计算复杂性,表明存在一种归纳偏见,反对更高复杂度的过程。此外,关于 Transformer 中局部注意力的工作证明了其表达能力以及与全局注意力的互补性,有可能提高模型质量。最后,研究调查了注意力汇聚(attention sinks)如何在反向传播过程中导致梯度汇聚(gradient sinks),而大规模激活则充当调节器。 AI

影响 这些研究为 Transformer 效率、计算复杂性和训练动态提供了理论和实证见解,可能指导未来的模型开发。

排序理由 多篇 arXiv 论文提出了关于 Transformer 架构、效率和计算特性的新研究。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 9 个来源。 我们如何撰写摘要 →

研究人员通过注意力控制和算法捕获探索高效 Transformer

报道来源 [9]

  1. arXiv cs.LG TIER_1 English(EN) · Amrit Nidhi ·

    Budgeted Attention Allocation: Cost-Conditioned Compute Control for Efficient Transformers

    arXiv:2605.05697v1 Announce Type: new Abstract: Transformers usually expose one inference cost per trained model, while deployed systems often need multiple cost-quality operating points. We study Budgeted Attention Allocation, a monotone head-gating mechanism conditioned on a re…

  2. arXiv cs.LG TIER_1 English(EN) · Orit Davidovich, Zohar Ringel ·

    Algorithmic Task Capture, Computational Complexity, and Inductive Bias of Infinite Transformers

    arXiv:2603.11161v2 Announce Type: replace Abstract: We formally define algorithmic capture of combinatorial tasks as the ability of a transformer to extrapolate to arbitrary task sizes with controllable error and logarithmic sample adaptation, providing a sharp scaling criterion …

  3. arXiv cs.LG TIER_1 English(EN) · Lena Ehrmuth, Laura Strieker ·

    Average Attention Transformers and Arithmetic Circuits

    arXiv:2605.04683v1 Announce Type: cross Abstract: We analyse the computational power of transformer encoders as sequence-to-sequence functions on vectors. We show that average hard attention can be used to simulate arithmetic circuits if they are given as an input to an encoder. …

  4. arXiv cs.LG TIER_1 English(EN) · Yihong Chen, Zhouchen Lin, Quanming Yao ·

    Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers

    arXiv:2603.17771v2 Announce Type: replace Abstract: Attention sinks and massive activations are recurring and closely related phenomena in Transformer models. Existing explanations have largely focused on the forward pass, yet in pre-norm Transformers, large residual-stream norms…

  5. arXiv cs.AI TIER_1 English(EN) · Laura Strieker ·

    Average Attention Transformers and Arithmetic Circuits

    We analyse the computational power of transformer encoders as sequence-to-sequence functions on vectors. We show that average hard attention can be used to simulate arithmetic circuits if they are given as an input to an encoder. The circuit families that can be simulated this wa…

  6. arXiv cs.LG TIER_1 English(EN) · Stephen J. Thomas ·

    Cascade Token Selection for Transformer Attention Acceleration

    arXiv:2605.03110v1 Announce Type: new Abstract: A method is presented for reducing the cost of representative token selection in transformer attention layers by exploiting the coherence of the representative set across depth. Activation Decorrelation Attention (ADA) selects $r \l…

  7. arXiv cs.CL TIER_1 English(EN) · Jiaoda Li, Ryan Cotterell ·

    Characterizing the Expressivity of Local Attention in Transformers

    arXiv:2605.00768v1 Announce Type: new Abstract: The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, which lets the model aggregate information from all preceding tokens before generat…

  8. arXiv cs.CL TIER_1 English(EN) · Ryan Cotterell ·

    Characterizing the Expressivity of Local Attention in Transformers

    The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, which lets the model aggregate information from all preceding tokens before generating the next token. One common variant of attent…

  9. dev.to — LLM tag TIER_1 English(EN) · Rijul Rajesh ·

    Understanding Decoder-Only Transformers Part 1: Masked Self-Attention

    <h2> Decoder-Only Transformers </h2> <p>In this article, we will explore <strong>decoder-only transformers</strong>.</p> <p>Decoder-only transformers are a specific type of transformer architecture used in systems like ChatGPT.</p> <h2> Masked Self-Attention </h2> <p>Decoder-only…