New research tackles attention mechanism limitations in transformers

arXiv cs.AI TIER_1 English(EN) · Xunhao Lai, Weiqi Xu, Yufeng Yang, Qiaorui Chen, Yang Xu, Lunbin Zeng, Xiaolong Li, Haohai Sun, Haichao Zhu, Vito Zhang, Pengyu Zhao · 2026-06-12 04:00

MiniMax Sparse Attention

arXiv:2606.13392v1 Announce Type: new Abstract: Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of t…

arXiv cs.AI TIER_1 English(EN) · Pengyu Zhao · 2026-06-11 14:23

MiniMax Sparse Attention

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attenti…

arXiv cs.AI TIER_1 English(EN) · Alejandro Garc\'ia-Castellanos, Maurice Weiler, Erik J Bekkers · 2026-06-11 04:00

RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways

arXiv:2606.11275v1 Announce Type: cross Abstract: Rotary Position Embeddings (RoPE) make attention scores position-relative but leave the value pathway position-blind: the message sent by a value token is the same regardless of its distance from the query. We propose RoVE, a para…

arXiv cs.CL TIER_1 English(EN) · Joshua Nunley · 2026-06-11 04:00

Kuramoto Attention: Synchronizing Self-Attention on the Torus

arXiv:2606.11585v1 Announce Type: cross Abstract: We introduce Kuramoto attention, a self-attention layer in which each hidden coordinate is an angle. The layer scores tokens by gated cosine similarity, attends over previous phase states, and updates each token by the tangent com…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 00:00

MiniMax Sparse Attention

MiniMax Sparse Attention enables efficient processing of ultra-long contexts in large language models through blockwise sparsity and optimized GPU execution, achieving significant speedups while maintaining performance.

arXiv cs.AI TIER_1 English(EN) · Xin Wang, Hui Shen, Boyuan Zheng, Xueshen Liu, Minkyoung Cho, Zhongwei Wan, Zesen Zhao, Zhuoqing Mao, Shen Yan, Mi Zhang · 2026-06-10 04:00

Dynamic Linear Attention

arXiv:2606.10650v1 Announce Type: cross Abstract: The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To im…

arXiv cs.LG TIER_1 English(EN) · Kosti Koistinen, Kirsi Hellsten, Joni Herttuainen, Kimmo K. Kaski · 2026-06-10 04:00

Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention

arXiv:2603.10676v2 Announce Type: replace Abstract: Industrial Control Systems (ICS) underpin critical infrastructure and face growing cyber-physical threats due to the convergence of operational technology and networked environments. While machine learning-based anomaly detectio…

arXiv cs.CL TIER_1 English(EN) · Joshua Nunley · 2026-06-10 02:24

Kuramoto Attention: Synchronizing Self-Attention on the Torus

We introduce Kuramoto attention, a self-attention layer in which each hidden coordinate is an angle. The layer scores tokens by gated cosine similarity, attends over previous phase states, and updates each token by the tangent component of the attention-weighted circular mean. Be…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-09 09:57

Dynamic Linear Attention

The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To improve representation capacity under long contexts,…

arXiv cs.AI TIER_1 English(EN) · Mi Zhang · 2026-06-09 09:57

Dynamic Linear Attention

The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To improve representation capacity under long contexts,…

arXiv cs.AI TIER_1 English(EN) · Hongxing Wang, Harenome Razanajato, Zhen Zhang, Yujie Yuan, Hongsheng Liu · 2026-06-09 04:00

How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models

arXiv:2606.07703v1 Announce Type: cross Abstract: Long-context prefill remains expensive because full/GQA layers still score the historical sequence, even in hybrid models with local, sparse, linear, or recurrent components. We study how much dense attention is needed to preserve…

arXiv cs.AI TIER_1 English(EN) · Kewei Li, Rongying Zhang, Xueli Wang, Xiwen Gong, Zhongjian Wang, Lan Huang, Ruochi Zhang, Fengfeng Zhou · 2026-06-09 04:00

Frequency-Domain Latent Attention Gating for Cross-Domain Token Aggregation

arXiv:2606.08191v1 Announce Type: cross Abstract: Token aggregation is a common bottleneck in models that map token representations to sample-level predictions, yet most pooling methods operate only in the original token domain. We propose FLaG, a plug-in aggregation module that …

arXiv cs.LG TIER_1 English(EN) · L\'ea Bohbot, Cyril Letrouit, Gabriel Peyr\'e, Fran\c{c}ois-Xavier Vialard · 2026-06-09 04:00

Token Sample Complexity of Attention

arXiv:2512.10656v3 Announce Type: replace Abstract: As context windows in large language models continue to expand, it is essential to characterize how attention behaves at extreme sequence lengths. We introduce token sample complexity: the rate at which attention computed on $n$…

arXiv cs.LG TIER_1 English(EN) · Lukas Fesser, Mozes Jacobs, Thomas Fel, Andy Keller, Sham Kakade · 2026-06-09 04:00

A Unifying View of Attention Sinks: Two Algorithms, Two Solutions

arXiv:2606.08105v1 Announce Type: new Abstract: When attention concentrates on a single token, a sink, what is the model actually computing? Attention sinks are ubiquitous in softmax transformers, yet this shared visual signature can hide fundamentally different algorithms. We sh…

arXiv cs.AI TIER_1 English(EN) · Yanke Zhou, Yiduo Li, Hanlin Tang, Maohua Li, Kan Liu, Tao Lan, Lin Qu, Yuan Yao, Xiaoxing Ma · 2026-06-09 04:00

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

arXiv:2605.16928v2 Announce Type: replace-cross Abstract: Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating …

arXiv cs.AI TIER_1 English(EN) · Yang Liu, Dongxin Guo, Tom Zheng, Siu Ming Yiu, Liam Ning, Jikun Wu · 2026-06-09 04:00

Capacity-Controlled Global Attention for Graph Transformers

arXiv:2604.17324v2 Announce Type: replace-cross Abstract: Global self-attention drives modern graph transformers, yet the softmax at its core imposes a structural constraint rarely examined directly: every attention row is non-negative and sums to one, so each per-head output is …

arXiv cs.AI TIER_1 English(EN) · Yan Wang, Qifan Zhang, Jiachen Yu, Tian Liang, Dongyang Ma, Xiang Hu, Zibo Lin, Chunyang Li, Zhichao Wang, Jia Li, Yujiu Yang, Haitao Mi, Dong Yu · 2026-06-09 04:00

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

arXiv:2606.09079v1 Announce Type: cross Abstract: Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powere…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-09 00:00

Dynamic Linear Attention

DLA addresses limitations in long-context LLMs by introducing adaptive state merging and capacity-bounded memory modeling for improved multi-state linear attention.

arXiv cs.AI TIER_1 English(EN) · Lin Huang, Chengxiang Huang, Ziang Wang, Yiyue Du, Chu Wang, Haocheng Lu, Yunyang Li, Xiaoli Liu, Arthur Jiang, Jia Zhang · 2026-06-08 04:00

E2Former-V2: On-the-Fly Equivariant Attention with Linear Activation Memory

arXiv:2601.16622v2 Announce Type: replace-cross Abstract: Equivariant Graph Neural Networks (EGNNs) have become a widely used approach for modeling 3D atomistic systems. However, mainstream architectures face critical scalability bottlenecks due to the explicit construction of ge…

arXiv cs.AI TIER_1 English(EN) · Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State · 2026-06-08 04:00

Limitations of Normalization in Attention Mechanism

arXiv:2508.17821v3 Announce Type: replace-cross Abstract: This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation invo…

arXiv cs.LG TIER_1 English(EN) · Justin Y. Chen, Ying Feng, Piotr Indyk, Michael Kapralov, Ekaterina Kochetkova, Boris Prokhorov · 2026-06-08 04:00

Towards Tight Bounds for Streaming Attention

arXiv:2606.07205v1 Announce Type: cross Abstract: The attention mechanism is a cornerstone of modern transformer architectures. However, its expressive power comes at the cost of quadratic runtime and linear space usage. In particular, the classical transformer architecture expli…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-08 00:00

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Lookahead Sparse Attention with Neural Memory Indexer reduces GPU memory usage for long-context LLM inference while maintaining accuracy through proactive KV cache management and decoupled training.

arXiv cs.AI TIER_1 English(EN) · Fengfeng Zhou · 2026-06-06 14:21

Frequency-Domain Latent Attention Gating for Cross-Domain Token Aggregation

Token aggregation is a common bottleneck in models that map token representations to sample-level predictions, yet most pooling methods operate only in the original token domain. We propose FLaG, a plug-in aggregation module that transforms token representations with the real FFT…

arXiv cs.AI TIER_1 English(EN) · Zhuoming Chen, Xinrui Zhong, Qilong Feng, Ranajoy Sadhukhan, Yang Zhou, Michael Qizhe Shieh, Zhihao Jia, Beidi Chen · 2026-06-06 04:00

Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents

arXiv:2606.06453v1 Announce Type: new Abstract: Sparse attention is becoming increasingly important for serving large language models (LLMs) as generation lengths continue to grow. However, deploying and evaluating new sparse attention algorithms at scale remains highly engineeri…

arXiv cs.LG TIER_1 English(EN) · Boris Prokhorov · 2026-06-05 12:15

Towards Tight Bounds for Streaming Attention

The attention mechanism is a cornerstone of modern transformer architectures. However, its expressive power comes at the cost of quadratic runtime and linear space usage. In particular, the classical transformer architecture explicitly stores all previously seen input elements (t…

arXiv cs.LG TIER_1 English(EN) · O. Duranthon, F. Boncoraglio, L. Zdeborov\'a · 2026-06-05 04:00

High-Dimensional Theory of LoRA Fine-Tuning in a Solvable Attention Model

arXiv:2606.05899v1 Announce Type: new Abstract: We develop a high-dimensional statistical theory of low-rank adaptation (LoRA) in attention models, capturing the interplay between pre-training and fine-tuning. We introduce a solvable framework in which a single-head attention lay…

arXiv cs.CL TIER_1 English(EN) · Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang, Furu Wei · 2026-06-05 04:00

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

arXiv:2606.06467v1 Announce Type: new Abstract: Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face…

arXiv cs.LG TIER_1 English(EN) · Yaobo Zhang · 2026-06-05 04:00

PJ-RoPE: A Fourier-Jet-Affine Position Space for Relative Attention

arXiv:2606.05345v1 Announce Type: new Abstract: We unify RoPE's Fourier phase, Jordan-RoPE's finite jets, and ALiBi's affine recency into a single learnable relative-position space, and study which regions of this space are selected by different tasks. PJ-RoPE is a Fourier-Jet-Af…

arXiv cs.LG TIER_1 English(EN) · M. Sagitova, O. Duranthon, L. Zdeborov\'a · 2026-06-05 04:00

Specialization of softmax attention heads: insights from the high-dimensional single-location model

arXiv:2603.03993v2 Announce Type: replace Abstract: Multi-head attention enables transformer models to represent multiple attention patterns simultaneously. Empirically, head specialization emerges in distinct stages during training, while many heads remain redundant and learn si…

arXiv cs.AI TIER_1 English(EN) · Furu Wei · 2026-06-04 17:54

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face a practical efficiency-quality trade-off. Struc…

arXiv cs.AI TIER_1 English(EN) · Beidi Chen · 2026-06-04 17:48

Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents

Sparse attention is becoming increasingly important for serving large language models (LLMs) as generation lengths continue to grow. However, deploying and evaluating new sparse attention algorithms at scale remains highly engineering-intensive, slowing both human researchers and…

arXiv cs.LG TIER_1 English(EN) · Yilun Kuang, Noah Amsel, Sanae Lotfi, Shikai Qiu, Andres Potapczynski, Andrew Gordon Wilson · 2026-06-04 04:00

Customizing the Inductive Biases of Softmax Attention using Structured Matrices

arXiv:2509.07963v2 Announce Type: replace Abstract: The core component of attention is the scoring function, which transforms the inputs into low-dimensional queries and keys and takes the dot product of each pair. While the low-dimensional projection improves efficiency, it caus…

arXiv cs.LG TIER_1 English(EN) · Niloufar Alipour Talemi, Hossein Kashiani, Fatemeh Afghah · 2026-06-04 04:00

Hyper-ICL: Attention Calibration with Hyperbolic Anchor Distillation for Multimodal In-Context Learning

arXiv:2606.04434v1 Announce Type: cross Abstract: Multimodal In-Context Learning (ICL) has emerged as a practical inference paradigm for Multimodal Large Language Models, where a small set of interleaved image-text In-Context Demonstrations (ICDs) conditions the model to solve ne…

arXiv cs.CL TIER_1 English(EN) · Zhenyi Shen, Junru Lu, Lin Gui, Jiazheng Li, Yulan He, Di Yin, Xing Sun · 2026-06-04 04:00

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

arXiv:2511.20102v3 Announce Type: replace Abstract: Sparse attention reduces the quadratic complexity of full self-attention but faces two challenges: (1) an attention gap, where applying sparse attention to full-attention-trained models causes performance degradation due to trai…

arXiv cs.CL TIER_1 English(EN) · Yaosheng Fu, Guangxuan Xiao, Xin Dong, Song Han, Oreste Villa · 2026-06-04 04:00

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

arXiv:2606.04511v1 Announce Type: new Abstract: Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfe…

arXiv cs.CL TIER_1 English(EN) · Oreste Villa · 2026-06-03 06:42

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfer bottleneck; (2) the sparse selection step itse…

arXiv cs.CL TIER_1 English(EN) · Jiaqian Li, Yanshu Li, Ligong Han, Ruixiang Tang, Wenya Wang · 2026-06-03 04:00

Train Once, Reuse Everywhere: Generalizable Implicit In-Context Learning by Routing Attention

arXiv:2509.22854v2 Announce Type: replace Abstract: Implicit in-context learning (ICL) has newly emerged as a promising paradigm that simulates ICL behaviors in the representation space of large language models (LLMs), aiming to attain few-shot performance at zero-shot cost. Howe…

arXiv cs.AI TIER_1 English(EN) · Xiaojie Xia, Huigang Zhang, Chaoliang Zhong, Jun Sun, Yusuke Oishi · 2026-06-03 04:00

Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction

arXiv:2601.11667v2 Announce Type: replace-cross Abstract: Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity with respect to sequence length limits practical deployment. Linear attention mechanisms …

arXiv cs.CL TIER_1 English(EN) · Difan Deng, Andreas Bentzen Winje, Lukas Fehring, Marius Lindauer · 2026-06-03 04:00

Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models

arXiv:2602.03681v2 Announce Type: replace Abstract: The quadratic computational complexity of softmax transformers has become a bottleneck in long-context scenarios. In contrast, linear attention model families provide a promising direction towards a more efficient sequential mod…

arXiv cs.LG TIER_1 English(EN) · Zhibo Yang · 2026-06-03 04:00

Locality Does Not Imply Reachability: Boundary Repair in Block-Sparse Causal Attention

arXiv:2606.02680v1 Announce Type: new Abstract: Sparse causal attention is usually described by sequence locality: nearby tokens should remain easy to access, while distant tokens may be dropped to reduce cost. This paper studies a mismatch between sequence locality and attention…

arXiv cs.AI TIER_1 English(EN) · Younjoo Lee, Seungkyun Dan, Junghoo Lee, Jaiyoung Park, Jung Ho Ahn · 2026-06-02 04:00

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

arXiv:2603.08026v2 Announce Type: replace-cross Abstract: Masked diffusion language models enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally e…

arXiv cs.AI TIER_1 English(EN) · Guoqiang Zhang · 2026-06-02 04:00

Improved Belief-Attention in Vision Task

arXiv:2606.00077v1 Announce Type: cross Abstract: Recently, Belief-Attention \cite{Guoqiang25BeliefAttention} has been proposed by first performing an orthogonal projection of the softmax-based weighted summation of $V$ vectors with respect to the original $V$ vectors and then ta…

arXiv cs.AI TIER_1 English(EN) · Bole Ma, Jan Eitzinger, Harald K\"ostler, Gerhard Wellein · 2026-06-02 04:00

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

arXiv:2606.01502v1 Announce Type: cross Abstract: Frontier LLMs increasingly decide what a query attends to with a sparse-attention indexer that picks a few KV-cache blocks per query: attention's unit is now a small, reusable chunk. Agentic workloads hammer it: many sub-agents qu…

arXiv cs.AI TIER_1 English(EN) · Hanze Li, Yaosong Du, Zhibo Yao, Mengyao Zeng, Xiuqi Ge, Xiande Huang · 2026-06-02 04:00

Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals

arXiv:2503.06473v5 Announce Type: replace-cross Abstract: Growing evidence suggests that layer attention mechanisms, which enhance interaction among layers in deep neural networks, have significantly advanced network architectures. However, existing layer attention methods suffer…

arXiv cs.AI TIER_1 English(EN) · Djohan Bonnet, Jamie Lohoff, Jan Finkbeiner, Elidona Skhikerujah, Emre Neftci · 2026-06-02 04:00

Learning to Remember, Learn, and Forget in Attention-Based Models

arXiv:2602.09075v3 Announce Type: replace-cross Abstract: In-Context Learning (ICL) in transformers acts as an online associative memory and is believed to underpin their high performance on complex sequence processing tasks. However, in gated linear attention models, this memory…

arXiv cs.CL TIER_1 English(EN) · Dong Le, Thong Nguyen, Cong-Duy Nguyen, Anh Tuan Luu · 2026-06-02 04:00

Don't Read Everything: A Curvature-Conditioned Query for Linear Attention

arXiv:2606.01294v1 Announce Type: new Abstract: Linear attention reduces the quadratic cost of softmax attention by maintaining a recurrent fast-weight state, but it consistently lags on in-context retrieval and long-context tasks. Existing remedies act on the write side of memor…

arXiv cs.CL TIER_1 English(EN) · Mingkuan Zhao, Yide Gao, Wentao Hu, Suquan Chen, Tianchen Huang, Zhenhua An, Zetao Chang, Xiayu Sun, Yuheng Min · 2026-06-02 04:00

Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time

arXiv:2606.01923v1 Announce Type: new Abstract: Large Language Models (LLMs) frequently exhibit "contextual disregard" when faced with input evidence that conflicts with their internal parametric memory, leading to persistent factual hallucinations. Existing mitigation strategies…

arXiv cs.AI TIER_1 English(EN) · Soohyeong Shin, Yeongwook Yang · 2026-06-02 04:00

Forget Attention: Importance-Aware Attention Is All You Need

arXiv:2606.02332v1 Announce Type: new Abstract: Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters b…

arXiv cs.AI TIER_1 English(EN) · Yeongwook Yang · 2026-06-01 14:42

Forget Attention: Importance-Aware Attention Is All You Need

Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters but cannot revisit. Existing hybrids -- Jamba (bl…

arXiv cs.CL TIER_1 English(EN) · Yuheng Min · 2026-06-01 08:57

Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time

Large Language Models (LLMs) frequently exhibit "contextual disregard" when faced with input evidence that conflicts with their internal parametric memory, leading to persistent factual hallucinations. Existing mitigation strategies primarily rely on suppressing specific neuron a…

arXiv cs.CL TIER_1 English(EN) · Ruina Hu, Chen Wang, Lai Wei, Jionghao Bai, Bin Yu, Weiran Huang, Kai Wang, Yue Wang · 2026-06-01 04:00

Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR

arXiv:2605.30912v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) improves vision-language models (VLMs) by optimizing outcome rewards derived from final answers. However, such outcome-only rewards do not tell the model which image regions ju…

arXiv cs.LG TIER_1 English(EN) · Jiefang Xiao, Maolin Gao, Simon Weber, Guandao Yang, Daniel Cremers · 2026-06-01 04:00

Functional Attention: From Pairwise Affinities to Functional Correspondences

arXiv:2605.31559v1 Announce Type: new Abstract: Learning mappings between infinite-dimensional function spaces, or operator learning, is essential for many machine learning applications. Although transformer-based operators are popular, they often rely on token-wise attention. Th…

arXiv cs.LG TIER_1 English(EN) · Wanli Zhong, Haibo Feng, Zirui Zhou, Hanyang Peng, Shiqi Yu · 2026-06-01 04:00

IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

arXiv:2511.21513v2 Announce Type: replace Abstract: Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax-related path as the dominant bottl…

arXiv cs.LG TIER_1 English(EN) · Daniel Cremers · 2026-05-29 17:22

Functional Attention: From Pairwise Affinities to Functional Correspondences

Learning mappings between infinite-dimensional function spaces, or operator learning, is essential for many machine learning applications. Although transformer-based operators are popular, they often rely on token-wise attention. These methods treat continuous fields as discrete …

arXiv cs.CL TIER_1 English(EN) · Siheng Xiong, Joe Zou, Faramarz Fekri, Yae Jee Cho · 2026-05-29 04:00

Long-Context Modeling with Dynamic Hierarchical Sparse Attention for Memory-Constrained LLM Inference

arXiv:2510.24606v2 Announce Type: replace Abstract: The quadratic cost of attention limits the scalability of long-context LLMs, especially under limited hardware memory budgets. While attention is often sparse, existing static sparse methods cannot adapt to task- or input-depend…

arXiv cs.LG TIER_1 English(EN) · Jiading Gai, Shuai Zhang, Xiang Song, Bernie Wang, George Karypis · 2026-05-29 04:00

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

arXiv:2605.15422v2 Announce Type: replace Abstract: Modern RL post-training methods such as GRPO and DAPO train on $N$ response sequences of $R$ tokens sampled from a shared prompt of $P$ tokens, but standard FlashAttention replicates all $P$ prompt tokens $N$ times across both f…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-29 00:00

Functional Attention: From Pairwise Affinities to Functional Correspondences

Functional Attention reinterprets attention as functional correspondence between adaptive bases, enabling compact and resolution-invariant operator learning for PDE solving and 3D segmentation.

arXiv cs.AI TIER_1 English(EN) · Gabriel Franco, Carson Loughridge, Mark Crovella · 2026-05-28 04:00

Singular Vectors of Attention Heads Align with Features

arXiv:2602.13524v2 Announce Type: replace-cross Abstract: Identifying feature representations in language models is a central task in mechanistic interpretability. Several recent studies have made the observation that feature representations can be inferred in some cases from sin…

arXiv cs.AI TIER_1 English(EN) · Ziyue Zhao, Qining Qi, Jianfa Ma · 2026-05-28 04:00

Manboformer: Learning Gaussian Representations via Spatial-temporal Attention Mechanism

arXiv:2503.04863v2 Announce Type: replace-cross Abstract: Compared with voxel-based grid prediction, in the field of 3D semantic occupation prediction for autonomous driving, GaussianFormer proposed using 3D Gaussian to describe scenes with sparse 3D semantic Gaussian based on ob…

arXiv cs.CL TIER_1 English(EN) · Keqi Deng, Shaoshi Ling, Ruchao Fan, Jinyu Li · 2026-05-28 04:00

UNIQUE: Universal Top-k Sparse Attention for Training-free Inference and Sparsity-aware Training

arXiv:2605.27740v1 Announce Type: new Abstract: Long-context inference in large language models (LLMs) is bottlenecked by the linear growth of the self-attention key-value (KV) cache. Top-k sparse attention alleviates this by loading only a small fraction of the KV cache, but acc…

arXiv cs.AI TIER_1 English(EN) · Hanjiang Wu, Abhimanyu Rajeshkumar Bambhaniya, Sarbartha Banerjee, Tuhin Khare, Sudarshan Srinivasan, Suvinay Subramanian, Souvik Kundu, Madhu Kumar, Midhilesh Elavazhagan, William Won, Amir Yazdanbakhsh, Tushar Krishna · 2026-05-28 04:00

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

arXiv:2605.28302v1 Announce Type: cross Abstract: Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT service-level objectives: from chunked-prefill aggregation, to prefill-decode (P/D) disaggre…

arXiv cs.LG TIER_1 English(EN) · Adam Zweiger, Xinghong Fu, Han Guo, Yoon Kim · 2026-05-28 04:00

Fast KV Compaction via Attention Matching

arXiv:2602.16284v2 Announce Type: replace Abstract: Scaling language models to long contexts is often bottlenecked by the size of the key-value (KV) cache. In deployed settings, long contexts are typically managed through compaction in token space via summarization. However, summ…

arXiv cs.LG TIER_1 English(EN) · Xiuying Wei, Caglar Gulcehre · 2026-05-28 04:00

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

arXiv:2605.28640v1 Announce Type: new Abstract: Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilate…

arXiv cs.LG TIER_1 English(EN) · Caglar Gulcehre · 2026-05-27 15:46

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we…

arXiv cs.LG TIER_1 English(EN) · Pedro Henrique da Costa Avelar, Anderson R. Tavares, Lu\'is C. Lamb · 2026-05-27 04:00

Is an Image Also Worth 16x16=256 Superpixels? A Framework for Attentional Image Classification

arXiv:2605.27144v1 Announce Type: cross Abstract: Superpixel-based image classification has traditionally leveraged graph neural networks (GNNs) for processing irregular image representations. Recent advances in computer vision, driven by Vision Transformers (ViTs), have introduc…

arXiv cs.CL TIER_1 English(EN) · Anay Chauhan, Gurucharan Marthi Krishna Kumar, Arion Das, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das · 2026-05-27 04:00

SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

arXiv:2605.18856v2 Announce Type: replace-cross Abstract: Long-context inference is increasingly constrained by the KV cache: resident memory grows with context length, and decoding becomes limited by repeated High Bandwidth Memory (HBM) streaming rather than arithmetic. Existing…

arXiv cs.AI TIER_1 English(EN) · Dongyun Zou, Zhuoyang Zhang, Junyu Chen, Wenkun He, Qinhe Peng, Hanrong Ye, Yao Lu, Hongxu Yin, Yu Wang, Song Han, Han Cai · 2026-05-27 04:00

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

arXiv:2605.26636v1 Announce Type: cross Abstract: We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full-attention vision foundation models while achieving substantially higher inference efficien…

arXiv cs.AI TIER_1 English(EN) · Yujie Chen, Tailai Chen, Yifeng Gao, Zoe Wanying He, Yijue Xu, Shaobo Wang, Linfeng Zhang · 2026-05-27 04:00

Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling

arXiv:2604.18103v2 Announce Type: replace Abstract: Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, prior methods rely on heuris…

arXiv cs.CL TIER_1 English(EN) · Athanasios Zeris · 2026-05-27 04:00

Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention

arXiv:2605.26355v1 Announce Type: cross Abstract: Standard transformer attention computes pairwise token similarity but treats all tokens as equally salient and all positions as equally local, regardless of the informational structure of the input. We identify two complementary i…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-27 00:00

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

RAT+ memory module enhances query-aware sparse inference methods by improving accuracy in long-context language models across various sparse budgets.

arXiv cs.LG TIER_1 English(EN) · Luís C. Lamb · 2026-05-26 15:09

Is an Image Also Worth 16x16=256 Superpixels? A Framework for Attentional Image Classification

Superpixel-based image classification has traditionally leveraged graph neural networks (GNNs) for processing irregular image representations. Recent advances in computer vision, driven by Vision Transformers (ViTs), have introduced new paradigms in self-attentional models, surpa…

arXiv cs.LG TIER_1 English(EN) · Weikang Meng, Yadan Luo, Liangyu Huo, Yingjian Li, Yaowei Wang, Xin Li, Zheng Zhang · 2026-05-26 04:00

Norm$\times$Direction: Restoring the Missing Query Norm in Vision Linear Attention

arXiv:2506.21137v3 Announce Type: replace Abstract: Linear attention mitigates the quadratic complexity of softmax attention but suffers from a critical loss of expressiveness. We identify two primary causes: (1) The normalization operation cancels the query norm, which breaks th…

arXiv cs.AI TIER_1 English(EN) · Xinghao Wang, Pengyu Wang, Xiaoran Liu, Fangxu Liu, Jason Chu, Kai Song, Xipeng Qiu · 2026-05-26 04:00

Prism: Spectral-Aware Block-Sparse Attention

arXiv:2602.08426v2 Announce Type: replace-cross Abstract: Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for…

arXiv cs.AI TIER_1 English(EN) · Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Sch\"olkopf · 2026-05-26 04:00

Intrinsically Interpretable Attention via Sparse Post-Training

arXiv:2512.05865v5 Announce Type: replace-cross Abstract: We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B…

arXiv cs.AI TIER_1 English(EN) · Spandan Pratyush · 2026-05-26 04:00

Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers

arXiv:2605.24518v1 Announce Type: cross Abstract: The quadratic complexity of self-attention in Transformer models remains a significant bottleneck for processing long sequences and deploying large language models efficiently. For this approach, there has been significant researc…

arXiv cs.AI TIER_1 Deutsch(DE) · Aditya Desai, Kumar Krishna Agrawal, Shuo Yang, Alejandro Cuadron, Luis Gaspar Schroeder, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica · 2026-05-26 04:00

vAttention: Verified Sparse Attention

arXiv:2510.05688v2 Announce Type: replace-cross Abstract: State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these appr…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-25 22:04

Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention

Standard transformer attention computes pairwise token similarity but treats all tokens as equally salient and all positions as equally local, regardless of the informational structure of the input. We identify two complementary inductive biases that standard attention lacks: ene…

arXiv cs.AI TIER_1 English(EN) · Xinyu Chen, Yuyi Qian, Jiang Lin, Shenyi Wang, Gao Wang, Zhiqiu Zhang, Jizhi Zhang, Mingjie Wang, Qiang Tang, Qian Wang, Song Wu, Zili Yi · 2026-05-25 04:00

SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion

arXiv:2605.23245v1 Announce Type: cross Abstract: Video object insertion requires ensuring spatio-temporal coherence and interactive realism, extending far beyond simple content placement. However, current approaches are often hindered by a reliance on explicit motion engineering…

arXiv cs.AI TIER_1 English(EN) · Xinghao Wang, Pengyu Wang, Dong Zhang, Chenkun Tan, Shaojun Zhou, Zhaoxiang Liu, Shiguo Lian, Fangxu Liu, Kai Song, Xipeng Qiu · 2026-05-25 04:00

Sparser Block-Sparse Attention via Token Permutation

arXiv:2510.21270v2 Announce Type: replace-cross Abstract: Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respec…

arXiv cs.AI TIER_1 English(EN) · Seonghwan Choi, Beomseok Kang, Dongwon Jo, Jae-Joon Kim · 2026-05-22 04:00

Retrospective Sparse Attention for Efficient Long-Context Generation

arXiv:2508.09001v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cach…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-16 00:00

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

RTPurbo leverages intrinsic sparsity in full-attention LLMs to achieve efficient long-context inference with minimal training overhead, enabling significant speedups while maintaining near-lossless accuracy.

arXiv cs.CV TIER_1 English(EN) · Tsz Lok Ip, Han Zhang, Lok Ming Lui · 2026-06-12 04:00

Learning Task-Aware Sampling with Shared Saliency through Density-Equalizing Mappings

arXiv:2606.12869v1 Announce Type: new Abstract: In image and surface-based learning tasks, convolutional features are typically extracted using receptive fields that are sampled uniformly across the entire domain. However, informative structures are rarely distributed uniformly i…

arXiv cs.CV TIER_1 English(EN) · Lok Ming Lui · 2026-06-11 03:56

Learning Task-Aware Sampling with Shared Saliency through Density-Equalizing Mappings

In image and surface-based learning tasks, convolutional features are typically extracted using receptive fields that are sampled uniformly across the entire domain. However, informative structures are rarely distributed uniformly in practice and are often concentrated in localiz…

arXiv cs.CV TIER_1 Deutsch(DE) · Sankeerth Durvasula, Kavya Sreedhar, Zain Moustafa, Suraj Kothawade, Tianlei Pang, Ashish Gondimalla, Suvinay Subramanian, Narges Shahidi, Nandita Vijaykumar · 2026-06-10 04:00

FG-Attn: Leveraging Fine-Grained Sparse Attention in Video Diffusion Models

arXiv:2509.16518v2 Announce Type: replace Abstract: Using diffusion transformers for media generation may require evaluating attention over extremely long sequences, with attention layers accounting for the majority of generation latency. Exploiting sparsity in attention maps off…

arXiv cs.CV TIER_1 English(EN) · Jie Ma, Zhike Qiu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji · 2026-06-09 04:00

Look Less, Reason More: Block-wise Attention Skipping for Efficient Multimodal LLMs

arXiv:2606.08511v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) face a significant inference bottleneck due to the quadratic computational cost of self-attention over long visual token sequences. However, we identify a critical inefficiency in current arc…

arXiv stat.ML TIER_1 English(EN) · Kabir Murjani · 2026-06-05 04:00

Zero-Copy Semantic Contagion: An In-Memory Streaming Architecture for Evolving Attention Graphs

arXiv:2606.05733v1 Announce Type: cross Abstract: Per-ticker forecasting models dominate financial time-series work yet remain blind to cross-company propagation: a foundry disruption in Taiwan does not register in a single-asset model until Apple's own price has already moved. T…

arXiv stat.ML TIER_1 English(EN) · Kabir Murjani · 2026-06-04 05:48

Zero-Copy Semantic Contagion: An In-Memory Streaming Architecture for Evolving Attention Graphs

Per-ticker forecasting models dominate financial time-series work yet remain blind to cross-company propagation: a foundry disruption in Taiwan does not register in a single-asset model until Apple's own price has already moved. To address this limitation, we introduce a heteroge…

arXiv cs.CV TIER_1 English(EN) · Fatemeh Afghah · 2026-06-03 04:32

Hyper-ICL: Attention Calibration with Hyperbolic Anchor Distillation for Multimodal In-Context Learning

Multimodal In-Context Learning (ICL) has emerged as a practical inference paradigm for Multimodal Large Language Models, where a small set of interleaved image-text In-Context Demonstrations (ICDs) conditions the model to solve new tasks. Despite its flexibility, multimodal ICL i…

arXiv cs.CV TIER_1 English(EN) · Wenzhuo Zhao, Ronghao Xian, Keren Fu, Qijun Zhao · 2026-06-03 04:00

Attend to Anything: Foundation Model for Unified Human Attention Modeling

arXiv:2606.03540v1 Announce Type: new Abstract: Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remai…

X — Together (inference / OSS) TIER_1 English(EN) · togethercompute · 2026-06-02 19:42

RT @ywangfirstlean: First technical Deepdive on M3 on the internet😎

X — Together (inference / OSS) TIER_1 English(EN) · togethercompute · 2026-06-02 19:38

MiniMax-M3 combines 1M context, native multimodality, and MiniMax Sparse Attention.

MiniMax-M3 combines 1M context, native multimodality, and MiniMax Sparse Attention. The next layer is serving it efficiently: KV-block-major sparse attention, paged MSA decode, optimized index scoring, and multimodal preprocessing before the GPU worker. Together’s Inference and

arXiv cs.CV TIER_1 English(EN) · Qijun Zhao · 2026-06-02 12:00

Attend to Anything: Foundation Model for Unified Human Attention Modeling

Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to …

arXiv stat.ML TIER_1 English(EN) · Chungpa Lee, Jy-yong Sohn, Kangwook Lee · 2026-06-02 04:00

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

arXiv:2602.23197v2 Announce Type: replace-cross Abstract: Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot pe…

arXiv stat.ML TIER_1 English(EN) · Tobias Schr\"oder, Lester Mackey · 2026-06-02 04:00

WildCat: Near-Linear Attention in Theory and Practice

arXiv:2602.10056v2 Announce Type: replace-cross Abstract: We introduce WildCat, a high-accuracy, low-cost approach to compressing the attention mechanism in neural networks. While attention is a staple of modern network architectures, it is also notoriously expensive to deploy du…

arXiv cs.CV TIER_1 English(EN) · David Hagerman, Roman Naeem, Jakob Lindqvist, Carl Lindstr\"om, Fredrik Kahl, Lennart Svensson · 2026-05-29 04:00

SwInception -- Local Attention Meets Convolutions

arXiv:2605.29954v1 Announce Type: new Abstract: Sparse vision transformers have gained popularity as efficient encoders for medical volumetric segmentation, with Swin emerging as a prominent choice. Swin uses local attention to reduce complexity and yields excellent performance f…

arXiv cs.CV TIER_1 English(EN) · Krishna Kumar Sharma, Somdyuti Paul · 2026-05-29 04:00

Accelerating HEVC Intra Partitioning via a CNN-Hierarchical Attention Transformer Hybrid

arXiv:2605.29063v1 Announce Type: cross Abstract: The recursive quad-tree partitioning in High Efficiency Video Coding (HEVC) incurs considerable computational overhead, with exhaustive rate-distortion optimization for CTU partition prediction consuming the dominant share of enco…

arXiv cs.CV TIER_1 English(EN) · Lennart Svensson · 2026-05-28 14:00

SwInception -- Local Attention Meets Convolutions

Sparse vision transformers have gained popularity as efficient encoders for medical volumetric segmentation, with Swin emerging as a prominent choice. Swin uses local attention to reduce complexity and yields excellent performance for many tasks but still tends to overfit on smal…

arXiv cs.CV TIER_1 English(EN) · Xi Leng, Xinhong Ma, Ziqiang Dong, Feng Zhang, Xiaoying Tang, Yang Yang, Guanjun Jiang · 2026-05-27 04:00

RAVE: Re-Allocating Visual Attention in Large Multimodal Models

arXiv:2605.18359v2 Announce Type: replace Abstract: Large multimodal models (LMMs) inherit the self-attention mechanism of pretrained language backbones, yet standard attention can exhibit suboptimal allocation, including cross-modal misallocation between textual and visual evide…

arXiv cs.CV TIER_1 English(EN) · Han Cai · 2026-05-26 07:17

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full-attention vision foundation models while achieving substantially higher inference efficiency on high-resolution images. At the core of our a…

arXiv cs.CV TIER_1 English(EN) · Rishabh Sabharwal, Ram Samarth B B, Parikshit Singh Rathore, Punit Rathore · 2026-05-26 04:00

STEAM: Squeeze and Transform Enhanced Attention Module

arXiv:2412.09023v2 Announce Type: replace Abstract: Channel and spatial attention mechanisms introduced by earlier works enhance the representation abilities of deep convolutional neural networks (CNNs) but often lead to increased parameter and computation costs. While recent app…

arXiv cs.CV TIER_1 English(EN) · Chengtao Lv, Yumeng Shi, Yushi Huang, Ruihao Gong, Shen Ren, Wenya Wang · 2026-05-26 04:00

Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention

arXiv:2602.04789v2 Announce Type: replace Abstract: Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attenti…

arXiv cs.CV TIER_1 English(EN) · Jie Hu, Zixiang Gao, Yutong He, Kun Yuan · 2026-05-25 04:00

DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation

arXiv:2605.23445v1 Announce Type: new Abstract: Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Blo…

arXiv cs.CV TIER_1 English(EN) · Kun Yuan · 2026-05-22 09:58

DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation

Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Block sparse attention is a common approach to miti…

arXiv cs.CV TIER_1 English(EN) · Zili Yi · 2026-05-22 05:28

SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion

Video object insertion requires ensuring spatio-temporal coherence and interactive realism, extending far beyond simple content placement. However, current approaches are often hindered by a reliance on explicit motion engineering or resource-intensive retraining, restricting the…

X — MiniMax AI TIER_1 English(EN) · MiniMax_AI · 2026-05-26 23:08

RT @eliebakouch: new minimax sparse attention compared to deepseek v3.2 (DSA) and v4 (CSA)

RT @eliebakouch: new minimax sparse attention compared to deepseek v3.2 (DSA) and v4 (CSA) main changes: - based on GQA not MLA - block le…

MarkTechPost TIER_1 English(EN) · Asif Razzaq · 2026-06-01 04:36

Parallax: A Parameterized Local Linear Attention That Keeps Softmax and Adds a Learned Covariance Correction Branch

<p>Parallax replaces LLA's per-query solver with a learned projector, doubling arithmetic intensity and improving perplexity at 0.6B and 1.7B.</p> <p>The post <a href="https://www.marktechpost.com/2026/05/31/parallax-a-parameterized-local-linear-attention-that-keeps-softmax-and-a…

Medium — MLOps tag TIER_1 English(EN) · The_Turingetic_Guy · 2026-05-31 19:06

Large-Scale Distributed LLM Inference — Part 2 : Modern Attention Mechanisms That Make Large-Scale…

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@the_turingetic_guy/large-scale-distributed-llm-inference-part-2-modern-attention-mechanisms-that-make-large-scale-09ba8d6581e1?source=rss------mlops-5"><img src="https://cdn-images-1.medium.co…

dev.to — LLM tag TIER_1 English(EN) · pueding · 2026-06-12 11:29

MiniMax M3 Ships Open-Weight 1M Context: MiniMax Sparse Attention (MSA)

<p> </p> <p><strong>What:</strong> The <strong>MiniMax M3</strong> release — an open-weight model with a <strong>1M-token context</strong> and <strong>59% on SWE-Bench Pro</strong> — is built on <strong>MiniMax Sparse Attention (MSA)</strong>, a block-sparse attention that gather…

r/LocalLLaMA TIER_1 English(EN) · /u/pmttyji · 2026-06-10 16:30

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1u277fg/flashmemorydeepseekv4_lightning_index_ultralong/"> <img alt="FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention" src="https://preview.redd.it/80um5y3neh6h1.png?w…

dev.to — LLM tag TIER_1 English(EN) · Tech_Nuggets · 2026-06-10 11:20

Flash Attention: what it does and why it matters

<h1> Flash Attention: what it does and why it matters </h1> <p>Your training job is paying for an A100 at $3/hour. The loss is going down, gradients are flowing, and the model's loss curve looks textbook-logarithmic. But if you profile the step time and look at what the GPU is ac…

dev.to — LLM tag TIER_1 English(EN) · Tech_Nuggets · 2026-06-10 09:58

Flash Attention: what it does and why it matters

<h1> Flash Attention: what it does and why it matters </h1> <p>You have a single H100 with 80 GB of VRAM. The Llama 3.1 70B model fits — barely, at 140 GB in FP16, so you're running at 4-bit quantization and have maybe 5–8 GB of KV cache space left for a long-context workload. Th…

r/LocalLLaMA TIER_1 English(EN) · /u/incarnadine72 · 2026-06-03 21:35

Inference optimization for MiniMax Sparse Attention

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1tw3hhj/inference_optimization_for_minimax_sparse/"> <img alt="Inference optimization for MiniMax Sparse Attention" src="https://external-preview.redd.it/nqixr6_8xKb8PDs_XEDdZ6xPkCKuRsEQX5B4Y55Kf_U.png?width=6…

dev.to — LLM tag TIER_1 English(EN) · Atlas Cloud · 2026-05-29 05:54

MiniMax Goes Sparse: Decoding M3's Attention from a Single Diagram

<p>On May 26, MiniMax R&D lead Skyler Miao posted a diagram on X — restrained palette, but a lot of information packed in. The title reads <em>MiniMax Sparse Attention</em>, and the two curves on the right give an eye-catching pair of numbers: <strong>9.7× prefill and 15.6× d…

r/LocalLLaMA TIER_1 English(EN) · /u/pmttyji · 2026-05-25 15:03

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

<div class="md"><blockquote> <p>Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesira…

Mastodon — mastodon.social TIER_1 English(EN) · [email protected] · 2026-06-01 04:52

Parallax introduces a parameterised local linear attention mechanism that preserves softmax while incorporating a learned covariance correction branch. Develope

Parallax introduces a parameterised local linear attention mechanism that preserves softmax while incorporating a learned covariance correction branch. Developed by researchers from Northwestern, Tilde Research and UW, the approach achieves up to 1.54× speedup over FlashAttention…

COVERAGE [115]

RELATED ENTITIES

RELATED TOPICS