研究人员开发了几种新方法来提高AI模型中注意力机制的效率。一种名为SimInsert的方法,通过将单帧编辑与时间传播解耦,专注于无缝视频对象插入。另一组技术,包括PBS-Attn和RetroAttention,旨在通过降低计算复杂性和提高KV缓存效率来优化处理长上下文的大型语言模型(LLMs)的注意力。此外,DFSAttn和RTPurbo提供了实现稀疏注意力的创新方法,无论是通过视频生成的动态细粒度稀疏化,还是通过最少量的训练将现有的全注意力模型转换为稀疏模型。
AI
arXiv:2510.05688v2 Announce Type: replace-cross Abstract: State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these appr…
arXiv:2506.21137v3 Announce Type: replace Abstract: Linear attention mitigates the quadratic complexity of softmax attention but suffers from a critical loss of expressiveness. We identify two primary causes: (1) The normalization operation cancels the query norm, which breaks th…
arXiv cs.AI
TIER_1English(EN)·Xinghao Wang, Pengyu Wang, Xiaoran Liu, Fangxu Liu, Jason Chu, Kai Song, Xipeng Qiu·
arXiv:2512.05865v5 Announce Type: replace-cross Abstract: We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B…
arXiv:2605.24518v1 Announce Type: cross Abstract: The quadratic complexity of self-attention in Transformer models remains a significant bottleneck for processing long sequences and deploying large language models efficiently. For this approach, there has been significant researc…
arXiv:2510.21270v2 Announce Type: replace-cross Abstract: Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respec…
arXiv:2605.23245v1 Announce Type: cross Abstract: Video object insertion requires ensuring spatio-temporal coherence and interactive realism, extending far beyond simple content placement. However, current approaches are often hindered by a reliance on explicit motion engineering…
arXiv cs.AI
TIER_1English(EN)·Seonghwan Choi, Beomseok Kang, Dongwon Jo, Jae-Joon Kim·
arXiv:2508.09001v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cach…
RTPurbo leverages intrinsic sparsity in full-attention LLMs to achieve efficient long-context inference with minimal training overhead, enabling significant speedups while maintaining near-lossless accuracy.
arXiv:2602.04789v2 Announce Type: replace Abstract: Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attenti…
arXiv:2412.09023v2 Announce Type: replace Abstract: Channel and spatial attention mechanisms introduced by earlier works enhance the representation abilities of deep convolutional neural networks (CNNs) but often lead to increased parameter and computation costs. While recent app…
arXiv cs.CV
TIER_1English(EN)·Jie Hu, Zixiang Gao, Yutong He, Kun Yuan·
arXiv:2605.23445v1 Announce Type: new Abstract: Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Blo…
Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Block sparse attention is a common approach to miti…
Video object insertion requires ensuring spatio-temporal coherence and interactive realism, extending far beyond simple content placement. However, current approaches are often hindered by a reliance on explicit motion engineering or resource-intensive retraining, restricting the…
<!-- SC_OFF --><div class="md"><blockquote> <p>Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesira…