研究人员探索用于大型语言模型的新型注意力机制和优化技术

作者 PulseAugur 编辑部 · [37 个来源] · 2026-05-04 01:57

研究人员正在探索新颖的注意力机制，以克服 transformer 中标准自注意力机制的二次复杂度，尤其是在长上下文处理方面。几篇论文介绍了诸如 Lighthouse Attention（用于高效预训练）、Robust Filter Attention（将注意力视为状态估计）以及受神经连接组启发的 Stochastic Attention（以提高表达能力）等方法。其他工作则侧重于通过稀疏注意力的提前停止（S2O）等技术优化注意力的计算足迹，并分析线性化注意力的理论极限。此外，还提出了一个名为 CuBridge 的框架，用于使用大型语言模型理解和重建高性能注意力核。 AI

影响这些进展旨在提高大型语言模型的效率和能力，使其能够更有效地处理更长的上下文和复杂的计算。

排序理由多篇 arXiv 论文介绍了 transformer 的新颖注意力机制和优化技术。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 37 个来源。我们如何撰写摘要 →

报道来源 [37]

arXiv cs.CL TIER_1 English(EN) · Marco Cuturi · 2026-05-10 21:51

Nectar：通过回归进行缓存令牌注意力的神经估计

Evaluating softmax attention over a fixed long context requires reading every cached key-value pair for each new query token. For a given context (a book, a manual, a legal corpus) the attention output is a deterministic function of the query. We propose Nectar, which fits a comp…
arXiv cs.LG TIER_1 English(EN) · Mingfei Sun · 2026-05-08 16:22

注意力与理解LoRA的收敛随机训练

Transformers have revolutionized machine learning and deploying attention layers in the model is increasingly standard across a myriad of applications. Further, for large models, it is common to implement Low Rank Adaptation (LoRA), whereby a factorized parameterization of them i…
arXiv cs.AI TIER_1 English(EN) · Qian Wang · 2026-05-08 13:24

面向长上下文推理的高效混合稀疏注意力与CPU-GPU并行机制

Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse attention reduces attention cost in this…
arXiv cs.AI TIER_1 English(EN) · Matias Selin · 2026-05-08 13:14

交叉注意力与编码器-解码器Transformer：一种逻辑表征

We give a novel logical characterization of encoder-decoder transformers, the foundational architecture for LLMs that also sees use in various settings that benefit from cross-attention. We study such transformers over text in the practical setting of floating-point numbers and s…
arXiv cs.LG TIER_1 English(EN) · Wenjie Pei · 2026-05-08 07:19

MISA：用于长上下文大语言模型推理的索引器稀疏注意力混合体

DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses …
arXiv cs.LG TIER_1 English(EN) · Elad Hoffer, Yochai Blau, Ron Banner, Daniel Soudry, Boris Ginsburg · 2026-05-08 04:00

从内部检索：基于注意力模型的内在能力

arXiv:2605.05806v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) typically treats retrieval and generation as separate systems. We ask whether an attention-based encoder-decoder can instead retrieve directly from its own internal representations. We introduce …
arXiv cs.LG TIER_1 English(EN) · Yulong Huang, Xiang Liu, Hongxiang Huang, Xiaopeng Lin, Zunchang Liu, Xiaowen Chu, Zeke Xie, Bojun Cheng · 2026-05-08 04:00

MDN：Delta线性注意力分步动量并行化

arXiv:2605.05838v1 Announce Type: new Abstract: Linear Attention (LA) offers a promising paradigm for scaling large language models (LLMs) to long sequences by avoiding the quadratic complexity of self-attention. Recent LA models such as Mamba2 and GDN interpret linear recurrence…
arXiv cs.LG TIER_1 English(EN) · Peter Racioppo · 2026-05-08 04:00

Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation

arXiv:2509.04154v5 Announce Type: replace Abstract: We introduce Robust Filter Attention (RFA), a formulation of self-attention as a robust state estimator. Each token is treated as a noisy observation of a latent trajectory governed by a linear stochastic differential equation (…
arXiv cs.LG TIER_1 English(EN) · Chendong Song, Meixuan Wang, Hang Zhou, Hong Liang, Yuan Lyu, Zixi Chen, Yuwei Fan, Zijie Zhou · 2026-05-08 04:00

Disaggregated LLM Serving 中理论最优的 Attention/FFN 比例

arXiv:2601.21351v2 Announce Type: replace Abstract: Attentio-FFN disaggregation (AFD) is an emerging architecture for LLM decoding that separates state-heavy, KV-cache-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step communi…
arXiv cs.LG TIER_1 English(EN) · Jose Marie Antonio Mi\~noza, Paulo Mario P. Medina, Sebastian C. Iba\~nez · 2026-05-08 04:00

线性注意力在任何实际宽度下都无法进入核状态

arXiv:2603.13085v2 Announce Type: replace Abstract: Understanding whether attention mechanisms converge to the kernel regime is foundational to the validity of influence functions for transformer accountability. Exact NTK characterization of softmax attention is precluded by its …
arXiv cs.CL TIER_1 English(EN) · Bowen Peng, Subho Ghosh, Jeffrey Quesnelle · 2026-05-08 04:00

使用 Lighthouse Attention 进行长上下文预训练

arXiv:2605.06554v1 Announce Type: new Abstract: Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-b…
arXiv cs.AI TIER_1 English(EN) · Edo Liberty, Alexandr Andoni, Eldar Kleiner · 2026-05-08 04:00

近乎最优的注意力核心集

arXiv:2605.05602v1 Announce Type: cross Abstract: We consider the problem of estimating the Attention mechanism in small space, and prove the existence of coresets for it of nearly optimal size. Specifically, we show that for any set of unit-norm keys and values $(K,V)$ in $\math…
arXiv cs.CL TIER_1 English(EN) · Jeffrey Quesnelle · 2026-05-07 16:49

Long Context Pre-Training with Lighthouse Attention

Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps…
arXiv cs.LG TIER_1 English(EN) · Xing Ma, Yangjie Zhou, Wu Sun, Zihan Liu, Jingwen Leng, Yun Lin, Shixuan Sun, Minyi Guo, Jin Song Dong · 2026-05-07 04:00

CuBridge：一个基于LLM的理解和重构高性能注意力核的框架

arXiv:2605.05023v1 Announce Type: new Abstract: Efficient CUDA implementations of attention mechanisms are critical to modern deep learning systems, yet supporting diverse and evolving attention variants remains challenging. Existing frameworks and compilers trade performance for…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-06 15:19

CuBridge：一个基于LLM的理解和重构高性能注意力内核的框架

Efficient CUDA implementations of attention mechanisms are critical to modern deep learning systems, yet supporting diverse and evolving attention variants remains challenging. Existing frameworks and compilers trade performance for flexibility, while expert-written kernels achie…
arXiv cs.LG TIER_1 English(EN) · Jin Song Dong · 2026-05-06 15:19

CuBridge：一个基于LLM的理解和重构高性能注意力核的框架

Efficient CUDA implementations of attention mechanisms are critical to modern deep learning systems, yet supporting diverse and evolving attention variants remains challenging. Existing frameworks and compilers trade performance for flexibility, while expert-written kernels achie…
arXiv cs.LG TIER_1 English(EN) · Zehao Jin, Yanan Sui · 2026-05-06 04:00

随机注意力：受连接组启发的随机路由，用于表达性线性时间注意力

arXiv:2604.00754v2 Announce Type: replace-cross Abstract: The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit lev…
arXiv cs.LG TIER_1 English(EN) · Wonsuk Lee · 2026-05-06 04:00

关于Softmax注意力不变性的研究

arXiv:2605.02907v1 Announce Type: new Abstract: Softmax attention maps every query--key interaction into a probability distribution, but the underlying structure remains largely unexplored. We define the \emph{energy field}, the row-centered attention logit, and show that it exhi…
arXiv cs.LG TIER_1 English(EN) · Yu Zhang, Songwei Liu, Chenqian Yan, Sheng Lin, Beichen Ning, Fangmin Chen, Xing Wang · 2026-05-06 04:00

S2O：通过在线置换实现稀疏注意力早期停止

arXiv:2602.22575v2 Announce Type: replace Abstract: Attention scales quadratically with sequence length, fundamentally limiting long-context inference. Existing block-granularity sparsification can reduce latency, but coarse blocks impose an intrinsic sparsity ceiling, making fur…
arXiv cs.LG TIER_1 English(EN) · Saleh Sargolzaei · 2026-05-05 04:00

单层注意力机制内的梯度提升

arXiv:2604.03190v2 Announce Type: replace Abstract: Transformer attention computes a single softmax-weighted average over values -- a one-pass estimate that cannot correct its own errors. We introduce \emph{gradient-boosted attention}, which applies the principle of gradient boos…
arXiv cs.CL TIER_1 English(EN) · Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavaram · 2026-05-05 04:00

DELTA：用于高效长上下文推理的动态层感知令牌注意力

arXiv:2510.09883v2 Announce Type: replace Abstract: Large reasoning models (LRMs) achieve state-of-the-art performance on challenging benchmarks by generating long chains of intermediate steps, but their inference cost is dominated by decoding, where each new token must attend to…
arXiv cs.LG TIER_1 English(EN) · Satwik Bhattamishra, Kulin Shah, Michael Hahn, Varun Kanade · 2026-05-05 04:00

可证明地学习带查询的注意力机制

arXiv:2601.16873v2 Announce Type: replace Abstract: We study the problem of learning Transformer-based sequence models with black-box access to their outputs. In this setting, a learner may adaptively query the oracle with any sequence of vectors and observe the output of the tar…
arXiv cs.LG TIER_1 English(EN) · Jaber Jaber, Osama Jaber · 2026-05-05 04:00

StreamIndex: 内存受限的流式 Top-k 压缩稀疏注意力

arXiv:2605.02568v1 Announce Type: new Abstract: DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a lightning indexer (a learned scoring projection over compressed keys) scores them, the top-k are selected per query, and a sparse attention kernel reads only those.…
arXiv cs.LG TIER_1 English(EN) · Kyle Lee, Corentin Delacour, Kevin Callahan-Coray, Kyle Jiang, Can Yaras, Samet Oymak, Tathagata Srimani, Kerem Y. Camsari · 2026-05-05 04:00

Stochastic Sparse Attention for Memory-Bound Inference

arXiv:2605.01910v1 Announce Type: new Abstract: Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that spa…
arXiv cs.LG TIER_1 English(EN) · Osama Jaber · 2026-05-04 13:19

StreamIndex：通过流式 Top-k 实现内存受限的压缩稀疏注意力

DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a lightning indexer (a learned scoring projection over compressed keys) scores them, the top-k are selected per query, and a sparse attention kernel reads only those. Public CSA implementations materialize a [B, S,…
arXiv cs.LG TIER_1 English(EN) · Xiuying Wei, Caglar Gulcehre · 2026-05-04 04:00

RAT+: 训练密集，推理稀疏 -- 扩张推理的循环增强注意力

arXiv:2602.18196v3 Announce Type: replace Abstract: Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. While prior work…
arXiv cs.LG TIER_1 English(EN) · Dongwon Jo, Beomseok Kang, Jiwon Song, Jae-Joon Kim · 2026-05-04 04:00

Token Sparse Attention：通过交错式 Token 选择实现高效长上下文推理

arXiv:2602.03216v2 Announce Type: replace-cross Abstract: The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-04 01:57

Projection-Free Transformers via Gaussian Kernel Attention

Self-attention in Transformers is typically implemented as $\mathrm{softmax}(QK^\top/\sqrt{d})V$, where $Q=XW_Q$, $K=XW_K$, and $V=XW_V$ are learned linear projections of the input $X$. We ask whether these learned projections are necessary, or whether they can be replaced by a s…
arXiv stat.ML TIER_1 English(EN) · Hugo Koubbi, Louis Hernandez, Matthieu Boussard · 2026-05-14 04:00

通过平均场注意力动力学理解 LoRA 中的灾难性遗忘

arXiv:2402.15415v2 Announce Type: replace-cross Abstract: Low-Rank Adaptation (LoRA) is the dominant parameter-efficient fine-tuning method due to its favorable compute-performance trade-off, yet it suffers from catastrophic forgetting. We study forgetting through a tractable _me…
arXiv stat.ML TIER_1 English(EN) · Tomohiro Hayase, Ryo Karakida · 2026-05-14 04:00

用于自注意力中逆温度关键缩放的统一框架

arXiv:2605.12697v1 Announce Type: new Abstract: Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\…
arXiv stat.ML TIER_1 English(EN) · Ryo Karakida · 2026-05-12 19:48

用于自注意力反向温度关键缩放的统一框架

Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\log n$ and $(\log n)^2$. We provide a general th…
arXiv cs.CV TIER_1 English(EN) · Xi Peng · 2026-05-12 09:56

从异构多视图数据中学习子空间保持稀疏注意力图

The high-dimensional features extracted from large-scale unlabeled data via various pretrained models with diverse architectures are referred to as heterogeneous multiview data. Most existing unsupervised transfer learning methods fail to faithfully recover intrinsic subspace str…
arXiv stat.ML TIER_1 English(EN) · Mohamed El Amine Seddik · 2026-05-11 04:00

注意力机制如何提供帮助？随机矩阵对序列模型信号恢复的洞见

arXiv:2605.06826v1 Announce Type: new Abstract: We study the spectral properties of sample covariance matrices constructed from pooled sequence representations, where token embeddings are drawn from a fixed two-class Gaussian mixture table and pooled via (fixed) attention weights…
arXiv cs.CV TIER_1 English(EN) · Suho Yoo, Youngjoon Jang, Joon Son Chung · 2026-05-08 04:00

关于塑造 Omni-LLMs 解码策略的注意力汇聚点的性质

arXiv:2603.14337v2 Announce Type: replace Abstract: The goal of this paper is to strengthen the reasoning of Omnimodal Large Language Models (Omni-LLMs) at inference time, without additional training. These models jointly process video, audio, and text, and given the large number…
arXiv stat.ML TIER_1 English(EN) · Mohamed El Amine Seddik · 2026-05-07 18:28

注意力如何提供帮助？从随机矩阵中获取信号恢复的见解，用于序列模型

We study the spectral properties of sample covariance matrices constructed from pooled sequence representations, where token embeddings are drawn from a fixed two-class Gaussian mixture table and pooled via (fixed) attention weights. Working in the high-dimensional regime $d,V,N\…
arXiv cs.CV TIER_1 English(EN) · Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, Ruilong Li · 2026-05-06 04:00

测试时训练结合 KV 绑定实际上是线性注意力

arXiv:2602.21204v3 Announce Type: replace-cross Abstract: Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomen…
dev.to — LLM tag TIER_1 English(EN) · 丁久 · 2026-05-12 11:06

神经网络中的注意力机制

<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/attention-mechanisms.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</em…

报道来源 [37]

相关实体

相关话题