Researchers explore novel attention mechanisms and optimization techniques for LLMs

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 37 sources

Researchers are exploring novel attention mechanisms to overcome the quadratic complexity of standard self-attention in transformers, particularly for long-context processing. Several papers introduce methods like Lighthouse Attention for efficient pre-training, Robust Filter Attention that frames attention as state estimation, and Stochastic Attention inspired by neural connectomes to improve expressivity. Other work focuses on optimizing attention's computational footprint through techniques like early stopping in sparse attention (S2O) and analyzing the theoretical limits of linearized attention. Additionally, a framework called CuBridge is presented for understanding and reconstructing high-performance attention kernels using LLMs. AI

Summary written by gemini-2.5-flash-lite from 37 sources. How we write summaries →

IMPACT These advancements aim to improve the efficiency and capability of large language models, enabling them to handle longer contexts and complex computations more effectively.

RANK_REASON Multiple arXiv papers introduce novel attention mechanisms and optimization techniques for transformers.

Read on Hugging Face Daily Papers →

paper
infra

COVERAGE [37]

arXiv cs.CL TIER_1 · Marco Cuturi · 2026-05-10 21:51

Nectar: Neural Estimation of Cached-Token Attention via Regression

Evaluating softmax attention over a fixed long context requires reading every cached key-value pair for each new query token. For a given context (a book, a manual, a legal corpus) the attention output is a deterministic function of the query. We propose Nectar, which fits a comp…
arXiv cs.LG TIER_1 · Mingfei Sun · 2026-05-08 16:22

Convergent Stochastic Training of Attention and Understanding LoRA

Transformers have revolutionized machine learning and deploying attention layers in the model is increasingly standard across a myriad of applications. Further, for large models, it is common to implement Low Rank Adaptation (LoRA), whereby a factorized parameterization of them i…
arXiv cs.AI TIER_1 · Qian Wang · 2026-05-08 13:24

An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse attention reduces attention cost in this…
arXiv cs.AI TIER_1 · Matias Selin · 2026-05-08 13:14

Cross-Attention and Encoder-Decoder Transformers: A Logical Characterization

We give a novel logical characterization of encoder-decoder transformers, the foundational architecture for LLMs that also sees use in various settings that benefit from cross-attention. We study such transformers over text in the practical setting of floating-point numbers and s…
arXiv cs.LG TIER_1 · Wenjie Pei · 2026-05-08 07:19

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses …
arXiv cs.LG TIER_1 · Elad Hoffer, Yochai Blau, Ron Banner, Daniel Soudry, Boris Ginsburg · 2026-05-08 04:00

Retrieval from Within: An Intrinsic Capability of Attention-Based Models

arXiv:2605.05806v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) typically treats retrieval and generation as separate systems. We ask whether an attention-based encoder-decoder can instead retrieve directly from its own internal representations. We introduce …
arXiv cs.LG TIER_1 · Yulong Huang, Xiang Liu, Hongxiang Huang, Xiaopeng Lin, Zunchang Liu, Xiaowen Chu, Zeke Xie, Bojun Cheng · 2026-05-08 04:00

MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

arXiv:2605.05838v1 Announce Type: new Abstract: Linear Attention (LA) offers a promising paradigm for scaling large language models (LLMs) to long sequences by avoiding the quadratic complexity of self-attention. Recent LA models such as Mamba2 and GDN interpret linear recurrence…
arXiv cs.LG TIER_1 · Peter Racioppo · 2026-05-08 04:00

Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation

arXiv:2509.04154v5 Announce Type: replace Abstract: We introduce Robust Filter Attention (RFA), a formulation of self-attention as a robust state estimator. Each token is treated as a noisy observation of a latent trajectory governed by a linear stochastic differential equation (…
arXiv cs.LG TIER_1 · Chendong Song, Meixuan Wang, Hang Zhou, Hong Liang, Yuan Lyu, Zixi Chen, Yuwei Fan, Zijie Zhou · 2026-05-08 04:00

Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving

arXiv:2601.21351v2 Announce Type: replace Abstract: Attentio-FFN disaggregation (AFD) is an emerging architecture for LLM decoding that separates state-heavy, KV-cache-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step communi…
arXiv cs.LG TIER_1 · Jose Marie Antonio Mi\~noza, Paulo Mario P. Medina, Sebastian C. Iba\~nez · 2026-05-08 04:00

Linearized Attention Cannot Enter the Kernel Regime at Any Practical Width

arXiv:2603.13085v2 Announce Type: replace Abstract: Understanding whether attention mechanisms converge to the kernel regime is foundational to the validity of influence functions for transformer accountability. Exact NTK characterization of softmax attention is precluded by its …
arXiv cs.CL TIER_1 · Bowen Peng, Subho Ghosh, Jeffrey Quesnelle · 2026-05-08 04:00

Long Context Pre-Training with Lighthouse Attention

arXiv:2605.06554v1 Announce Type: new Abstract: Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-b…
arXiv cs.AI TIER_1 · Edo Liberty, Alexandr Andoni, Eldar Kleiner · 2026-05-08 04:00

Nearly Optimal Attention Coresets

arXiv:2605.05602v1 Announce Type: cross Abstract: We consider the problem of estimating the Attention mechanism in small space, and prove the existence of coresets for it of nearly optimal size. Specifically, we show that for any set of unit-norm keys and values $(K,V)$ in $\math…
arXiv cs.CL TIER_1 · Jeffrey Quesnelle · 2026-05-07 16:49

Long Context Pre-Training with Lighthouse Attention

Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps…
arXiv cs.LG TIER_1 · Xing Ma, Yangjie Zhou, Wu Sun, Zihan Liu, Jingwen Leng, Yun Lin, Shixuan Sun, Minyi Guo, Jin Song Dong · 2026-05-07 04:00

CuBridge: An LLM-Based Framework for Understanding and Reconstructing High-Performance Attention Kernels

arXiv:2605.05023v1 Announce Type: new Abstract: Efficient CUDA implementations of attention mechanisms are critical to modern deep learning systems, yet supporting diverse and evolving attention variants remains challenging. Existing frameworks and compilers trade performance for…
Hugging Face Daily Papers TIER_1 · 2026-05-06 15:19

CuBridge: An LLM-Based Framework for Understanding and Reconstructing High-Performance Attention Kernels

Efficient CUDA implementations of attention mechanisms are critical to modern deep learning systems, yet supporting diverse and evolving attention variants remains challenging. Existing frameworks and compilers trade performance for flexibility, while expert-written kernels achie…
arXiv cs.LG TIER_1 · Jin Song Dong · 2026-05-06 15:19

CuBridge: An LLM-Based Framework for Understanding and Reconstructing High-Performance Attention Kernels

Efficient CUDA implementations of attention mechanisms are critical to modern deep learning systems, yet supporting diverse and evolving attention variants remains challenging. Existing frameworks and compilers trade performance for flexibility, while expert-written kernels achie…
arXiv cs.LG TIER_1 · Zehao Jin, Yanan Sui · 2026-05-06 04:00

Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention

arXiv:2604.00754v2 Announce Type: replace-cross Abstract: The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit lev…
arXiv cs.LG TIER_1 · Wonsuk Lee · 2026-05-06 04:00

On the Invariants of Softmax Attention

arXiv:2605.02907v1 Announce Type: new Abstract: Softmax attention maps every query--key interaction into a probability distribution, but the underlying structure remains largely unexplored. We define the \emph{energy field}, the row-centered attention logit, and show that it exhi…
arXiv cs.LG TIER_1 · Yu Zhang, Songwei Liu, Chenqian Yan, Sheng Lin, Beichen Ning, Fangmin Chen, Xing Wang · 2026-05-06 04:00

S2O: Early Stopping for Sparse Attention via Online Permutation

arXiv:2602.22575v2 Announce Type: replace Abstract: Attention scales quadratically with sequence length, fundamentally limiting long-context inference. Existing block-granularity sparsification can reduce latency, but coarse blocks impose an intrinsic sparsity ceiling, making fur…
arXiv cs.LG TIER_1 · Saleh Sargolzaei · 2026-05-05 04:00

Gradient Boosting within a Single Attention Layer

arXiv:2604.03190v2 Announce Type: replace Abstract: Transformer attention computes a single softmax-weighted average over values -- a one-pass estimate that cannot correct its own errors. We introduce \emph{gradient-boosted attention}, which applies the principle of gradient boos…
arXiv cs.CL TIER_1 · Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavaram · 2026-05-05 04:00

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

arXiv:2510.09883v2 Announce Type: replace Abstract: Large reasoning models (LRMs) achieve state-of-the-art performance on challenging benchmarks by generating long chains of intermediate steps, but their inference cost is dominated by decoding, where each new token must attend to…
arXiv cs.LG TIER_1 · Satwik Bhattamishra, Kulin Shah, Michael Hahn, Varun Kanade · 2026-05-05 04:00

Provably Learning Attention with Queries

arXiv:2601.16873v2 Announce Type: replace Abstract: We study the problem of learning Transformer-based sequence models with black-box access to their outputs. In this setting, a learner may adaptively query the oracle with any sequence of vectors and observe the output of the tar…
arXiv cs.LG TIER_1 · Jaber Jaber, Osama Jaber · 2026-05-05 04:00

StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

arXiv:2605.02568v1 Announce Type: new Abstract: DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a lightning indexer (a learned scoring projection over compressed keys) scores them, the top-k are selected per query, and a sparse attention kernel reads only those.…
arXiv cs.LG TIER_1 · Kyle Lee, Corentin Delacour, Kevin Callahan-Coray, Kyle Jiang, Can Yaras, Samet Oymak, Tathagata Srimani, Kerem Y. Camsari · 2026-05-05 04:00

Stochastic Sparse Attention for Memory-Bound Inference

arXiv:2605.01910v1 Announce Type: new Abstract: Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that spa…
arXiv cs.LG TIER_1 · Osama Jaber · 2026-05-04 13:19

StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a lightning indexer (a learned scoring projection over compressed keys) scores them, the top-k are selected per query, and a sparse attention kernel reads only those. Public CSA implementations materialize a [B, S,…
arXiv cs.LG TIER_1 · Xiuying Wei, Caglar Gulcehre · 2026-05-04 04:00

RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

arXiv:2602.18196v3 Announce Type: replace Abstract: Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. While prior work…
arXiv cs.LG TIER_1 · Dongwon Jo, Beomseok Kang, Jiwon Song, Jae-Joon Kim · 2026-05-04 04:00

Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

arXiv:2602.03216v2 Announce Type: replace-cross Abstract: The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently…
Hugging Face Daily Papers TIER_1 · 2026-05-04 01:57

Projection-Free Transformers via Gaussian Kernel Attention

Self-attention in Transformers is typically implemented as $\mathrm{softmax}(QK^\top/\sqrt{d})V$, where $Q=XW_Q$, $K=XW_K$, and $V=XW_V$ are learned linear projections of the input $X$. We ask whether these learned projections are necessary, or whether they can be replaced by a s…
arXiv stat.ML TIER_1 · Hugo Koubbi, Louis Hernandez, Matthieu Boussard · 2026-05-14 04:00

Understanding Catastrophic Forgetting In LoRA via Mean-Field Attention Dynamics

arXiv:2402.15415v2 Announce Type: replace-cross Abstract: Low-Rank Adaptation (LoRA) is the dominant parameter-efficient fine-tuning method due to its favorable compute-performance trade-off, yet it suffers from catastrophic forgetting. We study forgetting through a tractable _me…
arXiv stat.ML TIER_1 · Tomohiro Hayase, Ryo Karakida · 2026-05-14 04:00

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

arXiv:2605.12697v1 Announce Type: new Abstract: Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\…
arXiv stat.ML TIER_1 · Ryo Karakida · 2026-05-12 19:48

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\log n$ and $(\log n)^2$. We provide a general th…
arXiv cs.CV TIER_1 · Xi Peng · 2026-05-12 09:56

Learning Subspace-Preserving Sparse Attention Graphs from Heterogeneous Multiview Data

The high-dimensional features extracted from large-scale unlabeled data via various pretrained models with diverse architectures are referred to as heterogeneous multiview data. Most existing unsupervised transfer learning methods fail to faithfully recover intrinsic subspace str…
arXiv stat.ML TIER_1 · Mohamed El Amine Seddik · 2026-05-11 04:00

How Does Attention Help? Insights from Random Matrices on Signal Recovery from Sequence Models

arXiv:2605.06826v1 Announce Type: new Abstract: We study the spectral properties of sample covariance matrices constructed from pooled sequence representations, where token embeddings are drawn from a fixed two-class Gaussian mixture table and pooled via (fixed) attention weights…
arXiv cs.CV TIER_1 · Suho Yoo, Youngjoon Jang, Joon Son Chung · 2026-05-08 04:00

On the Nature of Attention Sink that Shapes Decoding Strategy in Omni-LLMs

arXiv:2603.14337v2 Announce Type: replace Abstract: The goal of this paper is to strengthen the reasoning of Omnimodal Large Language Models (Omni-LLMs) at inference time, without additional training. These models jointly process video, audio, and text, and given the large number…
arXiv stat.ML TIER_1 · Mohamed El Amine Seddik · 2026-05-07 18:28

How Does Attention Help? Insights from Random Matrices on Signal Recovery from Sequence Models

We study the spectral properties of sample covariance matrices constructed from pooled sequence representations, where token embeddings are drawn from a fixed two-class Gaussian mixture table and pooled via (fixed) attention weights. Working in the high-dimensional regime $d,V,N\…
arXiv cs.CV TIER_1 · Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, Ruilong Li · 2026-05-06 04:00

Test-Time Training with KV Binding Is Secretly Linear Attention

arXiv:2602.21204v3 Announce Type: replace-cross Abstract: Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomen…
dev.to — LLM tag TIER_1 · 丁久 · 2026-05-12 11:06

Attention Mechanisms in Neural Networks

<blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/attention-mechanisms.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</em…

COVERAGE [37]

RELATED ENTITIES

RELATED TOPICS