Researchers explore novel attention mechanisms and optimization techniques for LLMs
ByPulseAugur Editorial·
Summary by gemini-2.5-flash-lite
from 37 sources
Researchers are exploring novel attention mechanisms to overcome the quadratic complexity of standard self-attention in transformers, particularly for long-context processing. Several papers introduce methods like Lighthouse Attention for efficient pre-training, Robust Filter Attention that frames attention as state estimation, and Stochastic Attention inspired by neural connectomes to improve expressivity. Other work focuses on optimizing attention's computational footprint through techniques like early stopping in sparse attention (S2O) and analyzing the theoretical limits of linearized attention. Additionally, a framework called CuBridge is presented for understanding and reconstructing high-performance attention kernels using LLMs.
AI
IMPACT
These advancements aim to improve the efficiency and capability of large language models, enabling them to handle longer contexts and complex computations more effectively.
RANK_REASON
Multiple arXiv papers introduce novel attention mechanisms and optimization techniques for transformers.
Evaluating softmax attention over a fixed long context requires reading every cached key-value pair for each new query token. For a given context (a book, a manual, a legal corpus) the attention output is a deterministic function of the query. We propose Nectar, which fits a comp…
Transformers have revolutionized machine learning and deploying attention layers in the model is increasingly standard across a myriad of applications. Further, for large models, it is common to implement Low Rank Adaptation (LoRA), whereby a factorized parameterization of them i…
Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse attention reduces attention cost in this…
We give a novel logical characterization of encoder-decoder transformers, the foundational architecture for LLMs that also sees use in various settings that benefit from cross-attention. We study such transformers over text in the practical setting of floating-point numbers and s…
DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses …
arXiv cs.LG
TIER_1·Elad Hoffer, Yochai Blau, Ron Banner, Daniel Soudry, Boris Ginsburg·
arXiv:2605.05806v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) typically treats retrieval and generation as separate systems. We ask whether an attention-based encoder-decoder can instead retrieve directly from its own internal representations. We introduce …
arXiv:2605.05838v1 Announce Type: new Abstract: Linear Attention (LA) offers a promising paradigm for scaling large language models (LLMs) to long sequences by avoiding the quadratic complexity of self-attention. Recent LA models such as Mamba2 and GDN interpret linear recurrence…
arXiv:2509.04154v5 Announce Type: replace Abstract: We introduce Robust Filter Attention (RFA), a formulation of self-attention as a robust state estimator. Each token is treated as a noisy observation of a latent trajectory governed by a linear stochastic differential equation (…
arXiv cs.LG
TIER_1·Chendong Song, Meixuan Wang, Hang Zhou, Hong Liang, Yuan Lyu, Zixi Chen, Yuwei Fan, Zijie Zhou·
arXiv:2601.21351v2 Announce Type: replace Abstract: Attentio-FFN disaggregation (AFD) is an emerging architecture for LLM decoding that separates state-heavy, KV-cache-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step communi…
arXiv cs.LG
TIER_1·Jose Marie Antonio Mi\~noza, Paulo Mario P. Medina, Sebastian C. Iba\~nez·
arXiv:2603.13085v2 Announce Type: replace Abstract: Understanding whether attention mechanisms converge to the kernel regime is foundational to the validity of influence functions for transformer accountability. Exact NTK characterization of softmax attention is precluded by its …
arXiv:2605.06554v1 Announce Type: new Abstract: Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-b…
arXiv cs.AI
TIER_1·Edo Liberty, Alexandr Andoni, Eldar Kleiner·
arXiv:2605.05602v1 Announce Type: cross Abstract: We consider the problem of estimating the Attention mechanism in small space, and prove the existence of coresets for it of nearly optimal size. Specifically, we show that for any set of unit-norm keys and values $(K,V)$ in $\math…
Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps…
arXiv:2605.05023v1 Announce Type: new Abstract: Efficient CUDA implementations of attention mechanisms are critical to modern deep learning systems, yet supporting diverse and evolving attention variants remains challenging. Existing frameworks and compilers trade performance for…
Efficient CUDA implementations of attention mechanisms are critical to modern deep learning systems, yet supporting diverse and evolving attention variants remains challenging. Existing frameworks and compilers trade performance for flexibility, while expert-written kernels achie…
Efficient CUDA implementations of attention mechanisms are critical to modern deep learning systems, yet supporting diverse and evolving attention variants remains challenging. Existing frameworks and compilers trade performance for flexibility, while expert-written kernels achie…
arXiv:2604.00754v2 Announce Type: replace-cross Abstract: The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit lev…
arXiv:2605.02907v1 Announce Type: new Abstract: Softmax attention maps every query--key interaction into a probability distribution, but the underlying structure remains largely unexplored. We define the \emph{energy field}, the row-centered attention logit, and show that it exhi…
arXiv:2604.03190v2 Announce Type: replace Abstract: Transformer attention computes a single softmax-weighted average over values -- a one-pass estimate that cannot correct its own errors. We introduce \emph{gradient-boosted attention}, which applies the principle of gradient boos…
arXiv:2510.09883v2 Announce Type: replace Abstract: Large reasoning models (LRMs) achieve state-of-the-art performance on challenging benchmarks by generating long chains of intermediate steps, but their inference cost is dominated by decoding, where each new token must attend to…
arXiv cs.LG
TIER_1·Satwik Bhattamishra, Kulin Shah, Michael Hahn, Varun Kanade·
arXiv:2601.16873v2 Announce Type: replace Abstract: We study the problem of learning Transformer-based sequence models with black-box access to their outputs. In this setting, a learner may adaptively query the oracle with any sequence of vectors and observe the output of the tar…
arXiv:2605.02568v1 Announce Type: new Abstract: DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a lightning indexer (a learned scoring projection over compressed keys) scores them, the top-k are selected per query, and a sparse attention kernel reads only those.…
arXiv cs.LG
TIER_1·Kyle Lee, Corentin Delacour, Kevin Callahan-Coray, Kyle Jiang, Can Yaras, Samet Oymak, Tathagata Srimani, Kerem Y. Camsari·
arXiv:2605.01910v1 Announce Type: new Abstract: Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that spa…
DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a lightning indexer (a learned scoring projection over compressed keys) scores them, the top-k are selected per query, and a sparse attention kernel reads only those. Public CSA implementations materialize a [B, S,…
arXiv:2602.18196v3 Announce Type: replace Abstract: Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. While prior work…
arXiv cs.LG
TIER_1·Dongwon Jo, Beomseok Kang, Jiwon Song, Jae-Joon Kim·
arXiv:2602.03216v2 Announce Type: replace-cross Abstract: The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently…
Self-attention in Transformers is typically implemented as $\mathrm{softmax}(QK^\top/\sqrt{d})V$, where $Q=XW_Q$, $K=XW_K$, and $V=XW_V$ are learned linear projections of the input $X$. We ask whether these learned projections are necessary, or whether they can be replaced by a s…
arXiv stat.ML
TIER_1·Hugo Koubbi, Louis Hernandez, Matthieu Boussard·
arXiv:2402.15415v2 Announce Type: replace-cross Abstract: Low-Rank Adaptation (LoRA) is the dominant parameter-efficient fine-tuning method due to its favorable compute-performance trade-off, yet it suffers from catastrophic forgetting. We study forgetting through a tractable _me…
arXiv:2605.12697v1 Announce Type: new Abstract: Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\…
Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\log n$ and $(\log n)^2$. We provide a general th…
The high-dimensional features extracted from large-scale unlabeled data via various pretrained models with diverse architectures are referred to as heterogeneous multiview data. Most existing unsupervised transfer learning methods fail to faithfully recover intrinsic subspace str…
arXiv:2605.06826v1 Announce Type: new Abstract: We study the spectral properties of sample covariance matrices constructed from pooled sequence representations, where token embeddings are drawn from a fixed two-class Gaussian mixture table and pooled via (fixed) attention weights…
arXiv cs.CV
TIER_1·Suho Yoo, Youngjoon Jang, Joon Son Chung·
arXiv:2603.14337v2 Announce Type: replace Abstract: The goal of this paper is to strengthen the reasoning of Omnimodal Large Language Models (Omni-LLMs) at inference time, without additional training. These models jointly process video, audio, and text, and given the large number…
We study the spectral properties of sample covariance matrices constructed from pooled sequence representations, where token embeddings are drawn from a fixed two-class Gaussian mixture table and pooled via (fixed) attention weights. Working in the high-dimensional regime $d,V,N\…
arXiv cs.CV
TIER_1·Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, Ruilong Li·
arXiv:2602.21204v3 Announce Type: replace-cross Abstract: Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomen…
<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/attention-mechanisms.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</em…