新研究解决了 Transformer 中注意力机制的局限性

arXiv cs.AI TIER_1 English(EN) · Xunhao Lai, Weiqi Xu, Yufeng Yang, Qiaorui Chen, Yang Xu, Lunbin Zeng, Xiaolong Li, Haohai Sun, Haichao Zhu, Vito Zhang, Pengyu Zhao · 2026-06-12 04:00

MiniMax Sparse Attention

arXiv:2606.13392v1 Announce Type: new Abstract: Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of t…

arXiv cs.AI TIER_1 English(EN) · Pengyu Zhao · 2026-06-11 14:23

MiniMax Sparse Attention

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attenti…

arXiv cs.AI TIER_1 English(EN) · Alejandro Garc\'ia-Castellanos, Maurice Weiler, Erik J Bekkers · 2026-06-11 04:00

RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways

arXiv:2606.11275v1 Announce Type: cross Abstract: Rotary Position Embeddings (RoPE) make attention scores position-relative but leave the value pathway position-blind: the message sent by a value token is the same regardless of its distance from the query. We propose RoVE, a para…

arXiv cs.CL TIER_1 English(EN) · Joshua Nunley · 2026-06-11 04:00

Kuramoto Attention: Synchronizing Self-Attention on the Torus

arXiv:2606.11585v1 Announce Type: cross Abstract: We introduce Kuramoto attention, a self-attention layer in which each hidden coordinate is an angle. The layer scores tokens by gated cosine similarity, attends over previous phase states, and updates each token by the tangent com…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 00:00

MiniMax Sparse Attention

MiniMax Sparse Attention enables efficient processing of ultra-long contexts in large language models through blockwise sparsity and optimized GPU execution, achieving significant speedups while maintaining performance.

arXiv cs.AI TIER_1 English(EN) · Xin Wang, Hui Shen, Boyuan Zheng, Xueshen Liu, Minkyoung Cho, Zhongwei Wan, Zesen Zhao, Zhuoqing Mao, Shen Yan, Mi Zhang · 2026-06-10 04:00

Dynamic Linear Attention

arXiv:2606.10650v1 Announce Type: cross Abstract: The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To im…

arXiv cs.LG TIER_1 English(EN) · Kosti Koistinen, Kirsi Hellsten, Joni Herttuainen, Kimmo K. Kaski · 2026-06-10 04:00

Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention

arXiv:2603.10676v2 Announce Type: replace Abstract: Industrial Control Systems (ICS) underpin critical infrastructure and face growing cyber-physical threats due to the convergence of operational technology and networked environments. While machine learning-based anomaly detectio…

arXiv cs.CL TIER_1 English(EN) · Joshua Nunley · 2026-06-10 02:24

Kuramoto Attention：在环面上同步自注意力

We introduce Kuramoto attention, a self-attention layer in which each hidden coordinate is an angle. The layer scores tokens by gated cosine similarity, attends over previous phase states, and updates each token by the tangent component of the attention-weighted circular mean. Be…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-09 09:57

Dynamic Linear Attention

The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To improve representation capacity under long contexts,…

arXiv cs.AI TIER_1 English(EN) · Mi Zhang · 2026-06-09 09:57

动态线性注意力

The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To improve representation capacity under long contexts,…

arXiv cs.AI TIER_1 English(EN) · Hongxing Wang, Harenome Razanajato, Zhen Zhang, Yujie Yuan, Hongsheng Liu · 2026-06-09 04:00

密集注意力需要多少？混合长上下文模型中全/GQA层的Oracle引导稀疏预填充

arXiv:2606.07703v1 Announce Type: cross Abstract: Long-context prefill remains expensive because full/GQA layers still score the historical sequence, even in hybrid models with local, sparse, linear, or recurrent components. We study how much dense attention is needed to preserve…

arXiv cs.AI TIER_1 English(EN) · Kewei Li, Rongying Zhang, Xueli Wang, Xiwen Gong, Zhongjian Wang, Lan Huang, Ruochi Zhang, Fengfeng Zhou · 2026-06-09 04:00

用于跨域令牌聚合的频域潜在注意力门控

arXiv:2606.08191v1 Announce Type: cross Abstract: Token aggregation is a common bottleneck in models that map token representations to sample-level predictions, yet most pooling methods operate only in the original token domain. We propose FLaG, a plug-in aggregation module that …

arXiv cs.LG TIER_1 English(EN) · L\'ea Bohbot, Cyril Letrouit, Gabriel Peyr\'e, Fran\c{c}ois-Xavier Vialard · 2026-06-09 04:00

Attention 的 Token 样本复杂度

arXiv:2512.10656v3 Announce Type: replace Abstract: As context windows in large language models continue to expand, it is essential to characterize how attention behaves at extreme sequence lengths. We introduce token sample complexity: the rate at which attention computed on $n$…

arXiv cs.LG TIER_1 English(EN) · Lukas Fesser, Mozes Jacobs, Thomas Fel, Andy Keller, Sham Kakade · 2026-06-09 04:00

Attention Sinks 的统一视角：两种算法，两种解决方案

arXiv:2606.08105v1 Announce Type: new Abstract: When attention concentrates on a single token, a sink, what is the model actually computing? Attention sinks are ubiquitous in softmax transformers, yet this shared visual signature can hide fundamentally different algorithms. We sh…

arXiv cs.AI TIER_1 English(EN) · Yanke Zhou, Yiduo Li, Hanlin Tang, Maohua Li, Kan Liu, Tao Lan, Lin Qu, Yuan Yao, Xiaoxing Ma · 2026-06-09 04:00

Full Attention 卷土重来：在百步训练中将 Full Attention 迁移至 Sparse

arXiv:2605.16928v2 Announce Type: replace-cross Abstract: Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating …

arXiv cs.AI TIER_1 English(EN) · Yang Liu, Dongxin Guo, Tom Zheng, Siu Ming Yiu, Liam Ning, Jikun Wu · 2026-06-09 04:00

面向图变换器的容量控制全局注意力机制

arXiv:2604.17324v2 Announce Type: replace-cross Abstract: Global self-attention drives modern graph transformers, yet the softmax at its core imposes a structural constraint rarely examined directly: every attention row is non-negative and sums to one, so each per-head output is …

arXiv cs.AI TIER_1 English(EN) · Yan Wang, Qifan Zhang, Jiachen Yu, Tian Liang, Dongyang Ma, Xiang Hu, Zibo Lin, Chunyang Li, Zhichao Wang, Jia Li, Yujiu Yang, Haitao Mi, Dong Yu · 2026-06-09 04:00

FlashMemory-DeepSeek-V4：通过前瞻稀疏注意力实现闪电索引超长上下文

arXiv:2606.09079v1 Announce Type: cross Abstract: Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powere…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-09 00:00

Dynamic Linear Attention

DLA addresses limitations in long-context LLMs by introducing adaptive state merging and capacity-bounded memory modeling for improved multi-state linear attention.

arXiv cs.AI TIER_1 English(EN) · Lin Huang, Chengxiang Huang, Ziang Wang, Yiyue Du, Chu Wang, Haocheng Lu, Yunyang Li, Xiaoli Liu, Arthur Jiang, Jia Zhang · 2026-06-08 04:00

E2Former-V2：具有线性激活内存的即时等变注意力

arXiv:2601.16622v2 Announce Type: replace-cross Abstract: Equivariant Graph Neural Networks (EGNNs) have become a widely used approach for modeling 3D atomistic systems. However, mainstream architectures face critical scalability bottlenecks due to the explicit construction of ge…

arXiv cs.AI TIER_1 English(EN) · Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State · 2026-06-08 04:00

归一化在注意力机制中的局限性

arXiv:2508.17821v3 Announce Type: replace-cross Abstract: This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation invo…

arXiv cs.LG TIER_1 English(EN) · Justin Y. Chen, Ying Feng, Piotr Indyk, Michael Kapralov, Ekaterina Kochetkova, Boris Prokhorov · 2026-06-08 04:00

Towards Tight Bounds for Streaming Attention

arXiv:2606.07205v1 Announce Type: cross Abstract: The attention mechanism is a cornerstone of modern transformer architectures. However, its expressive power comes at the cost of quadratic runtime and linear space usage. In particular, the classical transformer architecture expli…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-08 00:00

FlashMemory-DeepSeek-V4：通过前瞻稀疏注意力实现闪电索引超长上下文

Lookahead Sparse Attention with Neural Memory Indexer reduces GPU memory usage for long-context LLM inference while maintaining accuracy through proactive KV cache management and decoupled training.

arXiv cs.AI TIER_1 English(EN) · Fengfeng Zhou · 2026-06-06 14:21

用于跨域令牌聚合的频域潜在注意力门控

Token aggregation is a common bottleneck in models that map token representations to sample-level predictions, yet most pooling methods operate only in the original token domain. We propose FLaG, a plug-in aggregation module that transforms token representations with the real FFT…

arXiv cs.AI TIER_1 English(EN) · Zhuoming Chen, Xinrui Zhong, Qilong Feng, Ranajoy Sadhukhan, Yang Zhou, Michael Qizhe Shieh, Zhihao Jia, Beidi Chen · 2026-06-06 04:00

Vortex：AI代理的高效可编程稀疏注意力服务

arXiv:2606.06453v1 Announce Type: new Abstract: Sparse attention is becoming increasingly important for serving large language models (LLMs) as generation lengths continue to grow. However, deploying and evaluating new sparse attention algorithms at scale remains highly engineeri…

arXiv cs.LG TIER_1 English(EN) · Boris Prokhorov · 2026-06-05 12:15

Towards Tight Bounds for Streaming Attention

The attention mechanism is a cornerstone of modern transformer architectures. However, its expressive power comes at the cost of quadratic runtime and linear space usage. In particular, the classical transformer architecture explicitly stores all previously seen input elements (t…

arXiv cs.LG TIER_1 English(EN) · O. Duranthon, F. Boncoraglio, L. Zdeborov\'a · 2026-06-05 04:00

可解注意力模型中 LoRA 微调的高维理论

arXiv:2606.05899v1 Announce Type: new Abstract: We develop a high-dimensional statistical theory of low-rank adaptation (LoRA) in attention models, capturing the interplay between pre-training and fine-tuning. We introduce a solvable framework in which a single-head attention lay…

arXiv cs.CL TIER_1 English(EN) · Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang, Furu Wei · 2026-06-05 04:00

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

arXiv:2606.06467v1 Announce Type: new Abstract: Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face…

arXiv cs.LG TIER_1 English(EN) · Yaobo Zhang · 2026-06-05 04:00

PJ-RoPE：用于相对注意力的傅里叶-喷气-仿射位置空间

arXiv:2606.05345v1 Announce Type: new Abstract: We unify RoPE's Fourier phase, Jordan-RoPE's finite jets, and ALiBi's affine recency into a single learnable relative-position space, and study which regions of this space are selected by different tasks. PJ-RoPE is a Fourier-Jet-Af…

arXiv cs.LG TIER_1 English(EN) · M. Sagitova, O. Duranthon, L. Zdeborov\'a · 2026-06-05 04:00

softmax注意力头的特化：高维单位置模型的洞见

arXiv:2603.03993v2 Announce Type: replace Abstract: Multi-head attention enables transformer models to represent multiple attention patterns simultaneously. Empirically, head specialization emerges in distinct stages during training, while many heads remain redundant and learn si…

arXiv cs.AI TIER_1 English(EN) · Furu Wei · 2026-06-04 17:54

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face a practical efficiency-quality trade-off. Struc…

arXiv cs.AI TIER_1 English(EN) · Beidi Chen · 2026-06-04 17:48

Vortex：AI代理的高效可编程稀疏注意力服务

Sparse attention is becoming increasingly important for serving large language models (LLMs) as generation lengths continue to grow. However, deploying and evaluating new sparse attention algorithms at scale remains highly engineering-intensive, slowing both human researchers and…

arXiv cs.LG TIER_1 English(EN) · Yilun Kuang, Noah Amsel, Sanae Lotfi, Shikai Qiu, Andres Potapczynski, Andrew Gordon Wilson · 2026-06-04 04:00

Customizing the Inductive Biases of Softmax Attention using Structured Matrices

arXiv:2509.07963v2 Announce Type: replace Abstract: The core component of attention is the scoring function, which transforms the inputs into low-dimensional queries and keys and takes the dot product of each pair. While the low-dimensional projection improves efficiency, it caus…

arXiv cs.LG TIER_1 English(EN) · Niloufar Alipour Talemi, Hossein Kashiani, Fatemeh Afghah · 2026-06-04 04:00

Hyper-ICL: Attention Calibration with Hyperbolic Anchor Distillation for Multimodal In-Context Learning

arXiv:2606.04434v1 Announce Type: cross Abstract: Multimodal In-Context Learning (ICL) has emerged as a practical inference paradigm for Multimodal Large Language Models, where a small set of interleaved image-text In-Context Demonstrations (ICDs) conditions the model to solve ne…

arXiv cs.CL TIER_1 English(EN) · Zhenyi Shen, Junru Lu, Lin Gui, Jiazheng Li, Yulan He, Di Yin, Xing Sun · 2026-06-04 04:00

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

arXiv:2511.20102v3 Announce Type: replace Abstract: Sparse attention reduces the quadratic complexity of full self-attention but faces two challenges: (1) an attention gap, where applying sparse attention to full-attention-trained models causes performance degradation due to trai…

arXiv cs.CL TIER_1 English(EN) · Yaosheng Fu, Guangxuan Xiao, Xin Dong, Song Han, Oreste Villa · 2026-06-04 04:00

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

arXiv:2606.04511v1 Announce Type: new Abstract: Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfe…

arXiv cs.CL TIER_1 English(EN) · Oreste Villa · 2026-06-03 06:42

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfer bottleneck; (2) the sparse selection step itse…

arXiv cs.CL TIER_1 English(EN) · Jiaqian Li, Yanshu Li, Ligong Han, Ruixiang Tang, Wenya Wang · 2026-06-03 04:00

Train Once, Reuse Everywhere: Generalizable Implicit In-Context Learning by Routing Attention

arXiv:2509.22854v2 Announce Type: replace Abstract: Implicit in-context learning (ICL) has newly emerged as a promising paradigm that simulates ICL behaviors in the representation space of large language models (LLMs), aiming to attain few-shot performance at zero-shot cost. Howe…

arXiv cs.AI TIER_1 English(EN) · Xiaojie Xia, Huigang Zhang, Chaoliang Zhong, Jun Sun, Yusuke Oishi · 2026-06-03 04:00

Distill-then-Replace：高效任务特定混合注意力模型构建

arXiv:2601.11667v2 Announce Type: replace-cross Abstract: Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity with respect to sequence length limits practical deployment. Linear attention mechanisms …

arXiv cs.CL TIER_1 English(EN) · Difan Deng, Andreas Bentzen Winje, Lukas Fehring, Marius Lindauer · 2026-06-03 04:00

Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models

arXiv:2602.03681v2 Announce Type: replace Abstract: The quadratic computational complexity of softmax transformers has become a bottleneck in long-context scenarios. In contrast, linear attention model families provide a promising direction towards a more efficient sequential mod…

arXiv cs.LG TIER_1 English(EN) · Zhibo Yang · 2026-06-03 04:00

Locality Does Not Imply Reachability: Boundary Repair in Block-Sparse Causal Attention

arXiv:2606.02680v1 Announce Type: new Abstract: Sparse causal attention is usually described by sequence locality: nearby tokens should remain easy to access, while distant tokens may be dropped to reduce cost. This paper studies a mismatch between sequence locality and attention…

arXiv cs.AI TIER_1 English(EN) · Younjoo Lee, Seungkyun Dan, Junghoo Lee, Jaiyoung Park, Jung Ho Ahn · 2026-06-02 04:00

DyLLM：通过显著性令牌选择和部分注意力实现高效的扩散 LLM 推理

arXiv:2603.08026v2 Announce Type: replace-cross Abstract: Masked diffusion language models enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally e…

arXiv cs.AI TIER_1 English(EN) · Guoqiang Zhang · 2026-06-02 04:00

改进的视觉任务中的信念注意力

arXiv:2606.00077v1 Announce Type: cross Abstract: Recently, Belief-Attention \cite{Guoqiang25BeliefAttention} has been proposed by first performing an orthogonal projection of the softmax-based weighted summation of $V$ vectors with respect to the original $V$ vectors and then ta…

arXiv cs.AI TIER_1 English(EN) · Bole Ma, Jan Eitzinger, Harald K\"ostler, Gerhard Wellein · 2026-06-02 04:00

移动查询而非缓存：跨 GPU 网络的实例间潜在注意力重分布特征分析

arXiv:2606.01502v1 Announce Type: cross Abstract: Frontier LLMs increasingly decide what a query attends to with a sparse-attention indexer that picks a few KV-cache blocks per query: attention's unit is now a small, reusable chunk. Agentic workloads hammer it: many sub-agents qu…

arXiv cs.AI TIER_1 English(EN) · Hanze Li, Yaosong Du, Zhibo Yao, Mengyao Zeng, Xiuqi Ge, Xiande Huang · 2026-06-02 04:00

通过修剪冗余检索增强层注意力效率

arXiv:2503.06473v5 Announce Type: replace-cross Abstract: Growing evidence suggests that layer attention mechanisms, which enhance interaction among layers in deep neural networks, have significantly advanced network architectures. However, existing layer attention methods suffer…

arXiv cs.AI TIER_1 English(EN) · Djohan Bonnet, Jamie Lohoff, Jan Finkbeiner, Elidona Skhikerujah, Emre Neftci · 2026-06-02 04:00

在基于注意力模型中学习记忆、学习和遗忘

arXiv:2602.09075v3 Announce Type: replace-cross Abstract: In-Context Learning (ICL) in transformers acts as an online associative memory and is believed to underpin their high performance on complex sequence processing tasks. However, in gated linear attention models, this memory…

arXiv cs.CL TIER_1 English(EN) · Dong Le, Thong Nguyen, Cong-Duy Nguyen, Anh Tuan Luu · 2026-06-02 04:00

不要全盘尽读：线性注意力的一种曲率条件查询

arXiv:2606.01294v1 Announce Type: new Abstract: Linear attention reduces the quadratic cost of softmax attention by maintaining a recurrent fast-weight state, but it consistently lags on in-context retrieval and long-context tasks. Existing remedies act on the write side of memor…

arXiv cs.CL TIER_1 English(EN) · Mingkuan Zhao, Yide Gao, Wentao Hu, Suquan Chen, Tianchen Huang, Zhenhua An, Zetao Chang, Xiayu Sun, Yuheng Min · 2026-06-02 04:00

共振上下文锚定：在推理时解耦注意力路由和信号增益

arXiv:2606.01923v1 Announce Type: new Abstract: Large Language Models (LLMs) frequently exhibit "contextual disregard" when faced with input evidence that conflicts with their internal parametric memory, leading to persistent factual hallucinations. Existing mitigation strategies…

arXiv cs.AI TIER_1 English(EN) · Soohyeong Shin, Yeongwook Yang · 2026-06-02 04:00

忘掉注意力机制：Importance-Aware Attention Is All You Need

arXiv:2606.02332v1 Announce Type: new Abstract: Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters b…

arXiv cs.AI TIER_1 English(EN) · Yeongwook Yang · 2026-06-01 14:42

忘掉注意力机制：Importance-Aware Attention Is All You Need

Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters but cannot revisit. Existing hybrids -- Jamba (bl…

arXiv cs.CL TIER_1 English(EN) · Yuheng Min · 2026-06-01 08:57

共振上下文锚定：在推理时解耦注意力路由和信号增益

Large Language Models (LLMs) frequently exhibit "contextual disregard" when faced with input evidence that conflicts with their internal parametric memory, leading to persistent factual hallucinations. Existing mitigation strategies primarily rely on suppressing specific neuron a…

arXiv cs.CL TIER_1 English(EN) · Ruina Hu, Chen Wang, Lai Wei, Jionghao Bai, Bin Yu, Weiran Huang, Kai Wang, Yue Wang · 2026-06-01 04:00

关注证据：面向多模态RLVR的基于证据的空间注意力监督

arXiv:2605.30912v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) improves vision-language models (VLMs) by optimizing outcome rewards derived from final answers. However, such outcome-only rewards do not tell the model which image regions ju…

arXiv cs.LG TIER_1 English(EN) · Jiefang Xiao, Maolin Gao, Simon Weber, Guandao Yang, Daniel Cremers · 2026-06-01 04:00

Functional Attention: From Pairwise Affinities to Functional Correspondences

arXiv:2605.31559v1 Announce Type: new Abstract: Learning mappings between infinite-dimensional function spaces, or operator learning, is essential for many machine learning applications. Although transformer-based operators are popular, they often rely on token-wise attention. Th…

arXiv cs.LG TIER_1 English(EN) · Wanli Zhong, Haibo Feng, Zirui Zhou, Hanyang Peng, Shiqi Yu · 2026-06-01 04:00

IntAttention：一种全整数注意力管道，用于高效的边缘推理

arXiv:2511.21513v2 Announce Type: replace Abstract: Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax-related path as the dominant bottl…

arXiv cs.LG TIER_1 English(EN) · Daniel Cremers · 2026-05-29 17:22

Functional Attention: From Pairwise Affinities to Functional Correspondences

Learning mappings between infinite-dimensional function spaces, or operator learning, is essential for many machine learning applications. Although transformer-based operators are popular, they often rely on token-wise attention. These methods treat continuous fields as discrete …

arXiv cs.CL TIER_1 English(EN) · Siheng Xiong, Joe Zou, Faramarz Fekri, Yae Jee Cho · 2026-05-29 04:00

面向内存受限大语言模型推理的动态分层稀疏注意力长上下文建模

arXiv:2510.24606v2 Announce Type: replace Abstract: The quadratic cost of attention limits the scalability of long-context LLMs, especially under limited hardware memory budgets. While attention is often sparse, existing static sparse methods cannot adapt to task- or input-depend…

arXiv cs.LG TIER_1 English(EN) · Jiading Gai, Shuai Zhang, Xiang Song, Bernie Wang, George Karypis · 2026-05-29 04:00

DualKV: 共享提示的闪存注意力，用于高效的 RL 训练，支持大型回放和长上下文

arXiv:2605.15422v2 Announce Type: replace Abstract: Modern RL post-training methods such as GRPO and DAPO train on $N$ response sequences of $R$ tokens sampled from a shared prompt of $P$ tokens, but standard FlashAttention replicates all $P$ prompt tokens $N$ times across both f…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-29 00:00

Functional Attention: From Pairwise Affinities to Functional Correspondences

Functional Attention reinterprets attention as functional correspondence between adaptive bases, enabling compact and resolution-invariant operator learning for PDE solving and 3D segmentation.

arXiv cs.AI TIER_1 English(EN) · Gabriel Franco, Carson Loughridge, Mark Crovella · 2026-05-28 04:00

注意力头的奇异向量与特征对齐

arXiv:2602.13524v2 Announce Type: replace-cross Abstract: Identifying feature representations in language models is a central task in mechanistic interpretability. Several recent studies have made the observation that feature representations can be inferred in some cases from sin…

arXiv cs.AI TIER_1 English(EN) · Ziyue Zhao, Qining Qi, Jianfa Ma · 2026-05-28 04:00

Manboformer：通过时空注意力机制学习高斯表示

arXiv:2503.04863v2 Announce Type: replace-cross Abstract: Compared with voxel-based grid prediction, in the field of 3D semantic occupation prediction for autonomous driving, GaussianFormer proposed using 3D Gaussian to describe scenes with sparse 3D semantic Gaussian based on ob…

arXiv cs.CL TIER_1 English(EN) · Keqi Deng, Shaoshi Ling, Ruchao Fan, Jinyu Li · 2026-05-28 04:00

UNIQUE：用于无训练推理和感知稀疏性训练的通用 Top-k 稀疏注意力

arXiv:2605.27740v1 Announce Type: new Abstract: Long-context inference in large language models (LLMs) is bottlenecked by the linear growth of the self-attention key-value (KV) cache. Top-k sparse attention alleviates this by loading only a small fraction of the KV cache, but acc…

arXiv cs.AI TIER_1 English(EN) · Hanjiang Wu, Abhimanyu Rajeshkumar Bambhaniya, Sarbartha Banerjee, Tuhin Khare, Sudarshan Srinivasan, Suvinay Subramanian, Souvik Kundu, Madhu Kumar, Midhilesh Elavazhagan, William Won, Amir Yazdanbakhsh, Tushar Krishna · 2026-05-28 04:00

拆分能走多远？面向高效MoE大模型服务的注意力-FFN拆分设计空间探索

arXiv:2605.28302v1 Announce Type: cross Abstract: Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT service-level objectives: from chunked-prefill aggregation, to prefill-decode (P/D) disaggre…

arXiv cs.LG TIER_1 English(EN) · Adam Zweiger, Xinghong Fu, Han Guo, Yoon Kim · 2026-05-28 04:00

通过注意力匹配实现快速 KV 压缩

arXiv:2602.16284v2 Announce Type: replace Abstract: Scaling language models to long contexts is often bottlenecked by the size of the key-value (KV) cache. In deployed settings, long contexts are typically managed through compaction in token space via summarization. However, summ…

arXiv cs.LG TIER_1 English(EN) · Xiuying Wei, Caglar Gulcehre · 2026-05-28 04:00

利用指数衰减记忆增强注意力机制可改善查询感知KV稀疏性

arXiv:2605.28640v1 Announce Type: new Abstract: Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilate…

arXiv cs.LG TIER_1 English(EN) · Caglar Gulcehre · 2026-05-27 15:46

指数衰减记忆增强注意力可改善查询感知KV稀疏性

Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we…

arXiv cs.LG TIER_1 English(EN) · Pedro Henrique da Costa Avelar, Anderson R. Tavares, Lu\'is C. Lamb · 2026-05-27 04:00

一张图像是否也值16x16=256个超像素？用于注意力图像分类的框架

arXiv:2605.27144v1 Announce Type: cross Abstract: Superpixel-based image classification has traditionally leveraged graph neural networks (GNNs) for processing irregular image representations. Recent advances in computer vision, driven by Vision Transformers (ViTs), have introduc…

arXiv cs.CL TIER_1 English(EN) · Anay Chauhan, Gurucharan Marthi Krishna Kumar, Arion Das, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das · 2026-05-27 04:00

SPHERICAL KV: 角度域注意力与率失真保持，实现高效长上下文推理

arXiv:2605.18856v2 Announce Type: replace-cross Abstract: Long-context inference is increasingly constrained by the KV cache: resident memory grows with context length, and decoding becomes limited by repeated High Bandwidth Memory (HBM) streaming rather than arithmetic. Existing…

arXiv cs.AI TIER_1 English(EN) · Dongyun Zou, Zhuoyang Zhang, Junyu Chen, Wenkun He, Qinhe Peng, Hanrong Ye, Yao Lu, Hongxu Yin, Yu Wang, Song Han, Han Cai · 2026-05-27 04:00

JetViT：一种高效的高分辨率Transformer模型，具备训练后注意力搜索功能

arXiv:2605.26636v1 Announce Type: cross Abstract: We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full-attention vision foundation models while achieving substantially higher inference efficien…

arXiv cs.AI TIER_1 English(EN) · Yujie Chen, Tailai Chen, Yifeng Gao, Zoe Wanying He, Yijue Xu, Shaobo Wang, Linfeng Zhang · 2026-05-27 04:00

稳定性意味着冗余：Delta Attention 选择性停止用于高效长上下文预填充

arXiv:2604.18103v2 Announce Type: replace Abstract: Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, prior methods rely on heuris…

arXiv cs.CL TIER_1 English(EN) · Athanasios Zeris · 2026-05-27 04:00

能量门控注意力与小波位置编码：Transformer注意力机制的互补归纳偏置

arXiv:2605.26355v1 Announce Type: cross Abstract: Standard transformer attention computes pairwise token similarity but treats all tokens as equally salient and all positions as equally local, regardless of the informational structure of the input. We identify two complementary i…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-27 00:00

指数衰减记忆增强注意力可改善查询感知KV稀疏性

RAT+ memory module enhances query-aware sparse inference methods by improving accuracy in long-context language models across various sparse budgets.

arXiv cs.LG TIER_1 English(EN) · Luís C. Lamb · 2026-05-26 15:09

一张图像是否也值16x16=256个超像素？用于注意力图像分类的框架

Superpixel-based image classification has traditionally leveraged graph neural networks (GNNs) for processing irregular image representations. Recent advances in computer vision, driven by Vision Transformers (ViTs), have introduced new paradigms in self-attentional models, surpa…

arXiv cs.LG TIER_1 English(EN) · Weikang Meng, Yadan Luo, Liangyu Huo, Yingjian Li, Yaowei Wang, Xin Li, Zheng Zhang · 2026-05-26 04:00

Norm$\times$Direction: 恢复视觉线性注意力中缺失的查询范数

arXiv:2506.21137v3 Announce Type: replace Abstract: Linear attention mitigates the quadratic complexity of softmax attention but suffers from a critical loss of expressiveness. We identify two primary causes: (1) The normalization operation cancels the query norm, which breaks th…

arXiv cs.AI TIER_1 English(EN) · Xinghao Wang, Pengyu Wang, Xiaoran Liu, Fangxu Liu, Jason Chu, Kai Song, Xipeng Qiu · 2026-05-26 04:00

Prism：光谱感知块稀疏注意力

arXiv:2602.08426v2 Announce Type: replace-cross Abstract: Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for…

arXiv cs.AI TIER_1 English(EN) · Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Sch\"olkopf · 2026-05-26 04:00

通过稀疏训练后内在可解释的注意力机制

arXiv:2512.05865v5 Announce Type: replace-cross Abstract: We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B…

arXiv cs.AI TIER_1 English(EN) · Spandan Pratyush · 2026-05-26 04:00

面向高效和可解释Transformer的语法引导稀疏注意力机制

arXiv:2605.24518v1 Announce Type: cross Abstract: The quadratic complexity of self-attention in Transformer models remains a significant bottleneck for processing long sequences and deploying large language models efficiently. For this approach, there has been significant researc…

arXiv cs.AI TIER_1 Deutsch(DE) · Aditya Desai, Kumar Krishna Agrawal, Shuo Yang, Alejandro Cuadron, Luis Gaspar Schroeder, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica · 2026-05-26 04:00

vAttention: Verified Sparse Attention

arXiv:2510.05688v2 Announce Type: replace-cross Abstract: State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these appr…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-25 22:04

能量门控注意力与小波位置编码：Transformer注意力机制的互补归纳偏置

Standard transformer attention computes pairwise token similarity but treats all tokens as equally salient and all positions as equally local, regardless of the informational structure of the input. We identify two complementary inductive biases that standard attention lacks: ene…

arXiv cs.AI TIER_1 English(EN) · Xinyu Chen, Yuyi Qian, Jiang Lin, Shenyi Wang, Gao Wang, Zhiqiu Zhang, Jizhi Zhang, Mingjie Wang, Qiang Tang, Qian Wang, Song Wu, Zili Yi · 2026-05-25 04:00

SimInsert：通过区域稀疏注意力融合实现无缝视频对象插入

arXiv:2605.23245v1 Announce Type: cross Abstract: Video object insertion requires ensuring spatio-temporal coherence and interactive realism, extending far beyond simple content placement. However, current approaches are often hindered by a reliance on explicit motion engineering…

arXiv cs.AI TIER_1 English(EN) · Xinghao Wang, Pengyu Wang, Dong Zhang, Chenkun Tan, Shaojun Zhou, Zhaoxiang Liu, Shiguo Lian, Fangxu Liu, Kai Song, Xipeng Qiu · 2026-05-25 04:00

Sparser Block-Sparse Attention via Token Permutation

arXiv:2510.21270v2 Announce Type: replace-cross Abstract: Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respec…

arXiv cs.AI TIER_1 English(EN) · Seonghwan Choi, Beomseok Kang, Dongwon Jo, Jae-Joon Kim · 2026-05-22 04:00

面向高效长上下文生成的追溯稀疏注意力机制

arXiv:2508.09001v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cach…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-16 00:00

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

RTPurbo leverages intrinsic sparsity in full-attention LLMs to achieve efficient long-context inference with minimal training overhead, enabling significant speedups while maintaining near-lossless accuracy.

arXiv cs.CV TIER_1 English(EN) · Tsz Lok Ip, Han Zhang, Lok Ming Lui · 2026-06-12 04:00

Learning Task-Aware Sampling with Shared Saliency through Density-Equalizing Mappings

arXiv:2606.12869v1 Announce Type: new Abstract: In image and surface-based learning tasks, convolutional features are typically extracted using receptive fields that are sampled uniformly across the entire domain. However, informative structures are rarely distributed uniformly i…

arXiv cs.CV TIER_1 English(EN) · Lok Ming Lui · 2026-06-11 03:56

Learning Task-Aware Sampling with Shared Saliency through Density-Equalizing Mappings

In image and surface-based learning tasks, convolutional features are typically extracted using receptive fields that are sampled uniformly across the entire domain. However, informative structures are rarely distributed uniformly in practice and are often concentrated in localiz…

arXiv cs.CV TIER_1 Deutsch(DE) · Sankeerth Durvasula, Kavya Sreedhar, Zain Moustafa, Suraj Kothawade, Tianlei Pang, Ashish Gondimalla, Suvinay Subramanian, Narges Shahidi, Nandita Vijaykumar · 2026-06-10 04:00

FG-Attn: Leveraging Fine-Grained Sparse Attention in Video Diffusion Models

arXiv:2509.16518v2 Announce Type: replace Abstract: Using diffusion transformers for media generation may require evaluating attention over extremely long sequences, with attention layers accounting for the majority of generation latency. Exploiting sparsity in attention maps off…

arXiv cs.CV TIER_1 English(EN) · Jie Ma, Zhike Qiu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji · 2026-06-09 04:00

少看多推理：用于高效多模态大语言模型的块状注意力跳过

arXiv:2606.08511v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) face a significant inference bottleneck due to the quadratic computational cost of self-attention over long visual token sequences. However, we identify a critical inefficiency in current arc…

arXiv stat.ML TIER_1 English(EN) · Kabir Murjani · 2026-06-05 04:00

零拷贝语义传播：用于演进式注意力图的内存流式架构

arXiv:2606.05733v1 Announce Type: cross Abstract: Per-ticker forecasting models dominate financial time-series work yet remain blind to cross-company propagation: a foundry disruption in Taiwan does not register in a single-asset model until Apple's own price has already moved. T…

arXiv stat.ML TIER_1 English(EN) · Kabir Murjani · 2026-06-04 05:48

零拷贝语义传播：用于演进式注意力图的内存流式架构

Per-ticker forecasting models dominate financial time-series work yet remain blind to cross-company propagation: a foundry disruption in Taiwan does not register in a single-asset model until Apple's own price has already moved. To address this limitation, we introduce a heteroge…

arXiv cs.CV TIER_1 English(EN) · Fatemeh Afghah · 2026-06-03 04:32

Hyper-ICL: Attention Calibration with Hyperbolic Anchor Distillation for Multimodal In-Context Learning

Multimodal In-Context Learning (ICL) has emerged as a practical inference paradigm for Multimodal Large Language Models, where a small set of interleaved image-text In-Context Demonstrations (ICDs) conditions the model to solve new tasks. Despite its flexibility, multimodal ICL i…

arXiv cs.CV TIER_1 English(EN) · Wenzhuo Zhao, Ronghao Xian, Keren Fu, Qijun Zhao · 2026-06-03 04:00

Attend to Anything: Foundation Model for Unified Human Attention Modeling

arXiv:2606.03540v1 Announce Type: new Abstract: Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remai…

X — Together (inference / OSS) TIER_1 English(EN) · togethercompute · 2026-06-02 19:42

RT @ywangfirstlean: 首个关于M3的技术深度解析出现在互联网上😎

RT @ywangfirstlean: First technical Deepdive on M3 on the internet😎

X — Together (inference / OSS) TIER_1 English(EN) · togethercompute · 2026-06-02 19:38

MiniMax-M3 结合了 1M 上下文、原生多模态和 MiniMax Sparse Attention。

MiniMax-M3 combines 1M context, native multimodality, and MiniMax Sparse Attention. The next layer is serving it efficiently: KV-block-major sparse attention, paged MSA decode, optimized index scoring, and multimodal preprocessing before the GPU worker. Together’s Inference and

arXiv cs.CV TIER_1 English(EN) · Qijun Zhao · 2026-06-02 12:00

Attend to Anything: Foundation Model for Unified Human Attention Modeling

Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to …

arXiv stat.ML TIER_1 English(EN) · Chungpa Lee, Jy-yong Sohn, Kangwook Lee · 2026-06-02 04:00

无需遗忘的微调与上下文学习：线性注意力模型的理论分析

arXiv:2602.23197v2 Announce Type: replace-cross Abstract: Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot pe…

arXiv stat.ML TIER_1 English(EN) · Tobias Schr\"oder, Lester Mackey · 2026-06-02 04:00

WildCat: 近乎线性的注意力机制的理论与实践

arXiv:2602.10056v2 Announce Type: replace-cross Abstract: We introduce WildCat, a high-accuracy, low-cost approach to compressing the attention mechanism in neural networks. While attention is a staple of modern network architectures, it is also notoriously expensive to deploy du…

arXiv cs.CV TIER_1 English(EN) · David Hagerman, Roman Naeem, Jakob Lindqvist, Carl Lindstr\"om, Fredrik Kahl, Lennart Svensson · 2026-05-29 04:00

SwInception -- 局部注意力机制与卷积的结合

arXiv:2605.29954v1 Announce Type: new Abstract: Sparse vision transformers have gained popularity as efficient encoders for medical volumetric segmentation, with Swin emerging as a prominent choice. Swin uses local attention to reduce complexity and yields excellent performance f…

arXiv cs.CV TIER_1 English(EN) · Krishna Kumar Sharma, Somdyuti Paul · 2026-05-29 04:00

通过CNN-分层注意力Transformer混合模型加速HEVC帧内分区

arXiv:2605.29063v1 Announce Type: cross Abstract: The recursive quad-tree partitioning in High Efficiency Video Coding (HEVC) incurs considerable computational overhead, with exhaustive rate-distortion optimization for CTU partition prediction consuming the dominant share of enco…

arXiv cs.CV TIER_1 English(EN) · Lennart Svensson · 2026-05-28 14:00

SwInception -- 局部注意力机制与卷积的结合

Sparse vision transformers have gained popularity as efficient encoders for medical volumetric segmentation, with Swin emerging as a prominent choice. Swin uses local attention to reduce complexity and yields excellent performance for many tasks but still tends to overfit on smal…

arXiv cs.CV TIER_1 English(EN) · Xi Leng, Xinhong Ma, Ziqiang Dong, Feng Zhang, Xiaoying Tang, Yang Yang, Guanjun Jiang · 2026-05-27 04:00

RAVE：在大型多模态模型中重新分配视觉注意力

arXiv:2605.18359v2 Announce Type: replace Abstract: Large multimodal models (LMMs) inherit the self-attention mechanism of pretrained language backbones, yet standard attention can exhibit suboptimal allocation, including cross-modal misallocation between textual and visual evide…

arXiv cs.CV TIER_1 English(EN) · Han Cai · 2026-05-26 07:17

JetViT：一种高效的高分辨率Transformer模型，具备训练后注意力搜索功能

We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full-attention vision foundation models while achieving substantially higher inference efficiency on high-resolution images. At the core of our a…

arXiv cs.CV TIER_1 English(EN) · Rishabh Sabharwal, Ram Samarth B B, Parikshit Singh Rathore, Punit Rathore · 2026-05-26 04:00

STEAM：Squeeze and Transform Enhanced Attention Module

arXiv:2412.09023v2 Announce Type: replace Abstract: Channel and spatial attention mechanisms introduced by earlier works enhance the representation abilities of deep convolutional neural networks (CNNs) but often lead to increased parameter and computation costs. While recent app…

arXiv cs.CV TIER_1 English(EN) · Chengtao Lv, Yumeng Shi, Yushi Huang, Ruihao Gong, Shen Ren, Wenya Wang · 2026-05-26 04:00

光强制：通过稀疏注意力加速自回归视频扩散

arXiv:2602.04789v2 Announce Type: replace Abstract: Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attenti…

arXiv cs.CV TIER_1 English(EN) · Jie Hu, Zixiang Gao, Yutong He, Kun Yuan · 2026-05-25 04:00

DFSAttn：用于高效视频生成的动态细粒度稀疏注意力

arXiv:2605.23445v1 Announce Type: new Abstract: Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Blo…

arXiv cs.CV TIER_1 English(EN) · Kun Yuan · 2026-05-22 09:58

DFSAttn：用于高效视频生成的动态细粒度稀疏注意力

Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Block sparse attention is a common approach to miti…

arXiv cs.CV TIER_1 English(EN) · Zili Yi · 2026-05-22 05:28

SimInsert：通过区域稀疏注意力融合实现无缝视频对象插入

Video object insertion requires ensuring spatio-temporal coherence and interactive realism, extending far beyond simple content placement. However, current approaches are often hindered by a reliance on explicit motion engineering or resource-intensive retraining, restricting the…

X — MiniMax AI TIER_1 English(EN) · MiniMax_AI · 2026-05-26 23:08

RT @eliebakouch: 新的 minimax 稀疏注意力与 deepseek v3.2 (DSA) 和 v4 (CSA) 对比

RT @eliebakouch: new minimax sparse attention compared to deepseek v3.2 (DSA) and v4 (CSA) main changes: - based on GQA not MLA - block le…

MarkTechPost TIER_1 English(EN) · Asif Razzaq · 2026-06-01 04:36

Parallax：一种参数化局部线性注意力机制，引入了Softmax并增加了一个学习的协方差校正分支

<p>Parallax replaces LLA's per-query solver with a learned projector, doubling arithmetic intensity and improving perplexity at 0.6B and 1.7B.</p> <p>The post <a href="https://www.marktechpost.com/2026/05/31/parallax-a-parameterized-local-linear-attention-that-keeps-softmax-and-a…

Medium — MLOps tag TIER_1 English(EN) · The_Turingetic_Guy · 2026-05-31 19:06

大规模分布式大语言模型推理 — 第二部分：实现大规模推理的现代注意力机制…

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@the_turingetic_guy/large-scale-distributed-llm-inference-part-2-modern-attention-mechanisms-that-make-large-scale-09ba8d6581e1?source=rss------mlops-5"><img src="https://cdn-images-1.medium.co…

dev.to — LLM tag TIER_1 English(EN) · pueding · 2026-06-12 11:29

MiniMax M3 Ships Open-Weight 1M Context: MiniMax Sparse Attention (MSA)

<p> </p> <p><strong>What:</strong> The <strong>MiniMax M3</strong> release — an open-weight model with a <strong>1M-token context</strong> and <strong>59% on SWE-Bench Pro</strong> — is built on <strong>MiniMax Sparse Attention (MSA)</strong>, a block-sparse attention that gather…

r/LocalLLaMA TIER_1 English(EN) · /u/pmttyji · 2026-06-10 16:30

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1u277fg/flashmemorydeepseekv4_lightning_index_ultralong/"> <img alt="FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention" src="https://preview.redd.it/80um5y3neh6h1.png?w…

dev.to — LLM tag TIER_1 English(EN) · Tech_Nuggets · 2026-06-10 11:20

Flash Attention: what it does and why it matters

<h1> Flash Attention: what it does and why it matters </h1> <p>Your training job is paying for an A100 at $3/hour. The loss is going down, gradients are flowing, and the model's loss curve looks textbook-logarithmic. But if you profile the step time and look at what the GPU is ac…

dev.to — LLM tag TIER_1 English(EN) · Tech_Nuggets · 2026-06-10 09:58

Flash Attention: what it does and why it matters

<h1> Flash Attention: what it does and why it matters </h1> <p>You have a single H100 with 80 GB of VRAM. The Llama 3.1 70B model fits — barely, at 140 GB in FP16, so you're running at 4-bit quantization and have maybe 5–8 GB of KV cache space left for a long-context workload. Th…

r/LocalLLaMA TIER_1 English(EN) · /u/incarnadine72 · 2026-06-03 21:35

Inference optimization for MiniMax Sparse Attention

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1tw3hhj/inference_optimization_for_minimax_sparse/"> <img alt="Inference optimization for MiniMax Sparse Attention" src="https://external-preview.redd.it/nqixr6_8xKb8PDs_XEDdZ6xPkCKuRsEQX5B4Y55Kf_U.png?width=6…

dev.to — LLM tag TIER_1 English(EN) · Atlas Cloud · 2026-05-29 05:54

MiniMax 采用稀疏化：从单张图解析 M3 的注意力机制

<p>On May 26, MiniMax R&D lead Skyler Miao posted a diagram on X — restrained palette, but a lot of information packed in. The title reads <em>MiniMax Sparse Attention</em>, and the two curves on the right give an eye-catching pair of numbers: <strong>9.7× prefill and 15.6× d…

r/LocalLLaMA TIER_1 English(EN) · /u/pmttyji · 2026-05-25 15:03

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

<div class="md"><blockquote> <p>Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesira…

Mastodon — mastodon.social TIER_1 English(EN) · [email protected] · 2026-06-01 04:52

Parallax 推出参数化局部线性注意力机制，在保留 softmax 的同时融入了学习到的协方差校正分支。开发

Parallax introduces a parameterised local linear attention mechanism that preserves softmax while incorporating a learned covariance correction branch. Developed by researchers from Northwestern, Tilde Research and UW, the approach achieves up to 1.54× speedup over FlashAttention…

报道来源 [115]

相关实体

相关话题