新的架构和框架针对长上下文的LLM服务瓶颈

作者 PulseAugur 编辑部 · [5 个来源] · 2026-04-23 20:12

研究人员开发了新的架构和技术，以解决服务具有长上下文的大型语言模型（LLMs）时日益增长的延迟和能耗挑战。一种名为AMMA的方法提出了一种以内存为中心的多芯片设计，用HBM-PNM立方体取代GPU计算芯片，以提高内存带宽，与NVIDIA H100相比，在延迟和能耗方面实现了显著降低。另一个框架SPIN将稀疏注意力算法与分层KV存储相结合，通过优化GPU和CPU内存之间的KV缓存管理来提高吞吐量并减少首次令牌生成时间。此外，LayerBoost提供了一种层感知注意力缩减方法，可以选择性地修改Transformer层内的注意力机制，在保持模型质量的同时将效率提高高达68%。 AI

影响新的架构和技术有望显著降低LLM服务的延迟和能耗成本，从而实现更高效的长上下文处理。

排序理由多篇学术论文提出了用于高效LLM服务的新架构和技术。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 5 个来源。我们如何撰写摘要 →

报道来源 [5]

arXiv cs.AI TIER_1 English(EN) · Zhongkai Yu, Haotian Ye, Chenyang Zhou, Ohm Rishabh Venkatachalam, Zaifeng Pan, Zhengding Hu, Junsung Kim, Won Woo Ro, Po-An Tsai, Shuyi Pei, Yangwook Kang, Yufei Ding · 2026-04-30 04:00

AMMA：一种用于低延迟 1M 上下文注意力服务的、多芯粒内存中心架构

arXiv:2604.26103v1 Announce Type: cross Abstract: All current LLM serving systems place the GPU at the center, from production-level attention-FFN disaggregation to NVIDIA's Rubin GPU-LPU heterogeneous platform. Even academic PIM/PNM proposals still treat the GPU as the central h…
arXiv cs.LG TIER_1 English(EN) · Zihan Zhao, Baotong Lu, Shengjie Lin, Yizou Chen, Jing Liu, Yanqi Zhang, Ziming Miao, Ming-Chang Yang, Haiying Shen, Qi Chen, Fan Yang · 2026-04-30 04:00

面向可扩展长上下文大语言模型服务的稀疏注意力与分层记忆统一

arXiv:2604.26837v1 Announce Type: new Abstract: Long-context LLM serving is bottlenecked by the cost of attending over ever-growing KV caches. Dynamic sparse attention promises relief by accessing only a small, query-dependent subset of the KV state per decoding step and extendin…
arXiv cs.LG TIER_1 English(EN) · Fan Yang · 2026-04-29 16:02

面向可扩展长上下文大模型服务的稀疏注意力与分层记忆统一

Long-context LLM serving is bottlenecked by the cost of attending over ever-growing KV caches. Dynamic sparse attention promises relief by accessing only a small, query-dependent subset of the KV state per decoding step and extending the KV storage to CPU memory. In practice, how…
arXiv cs.CL TIER_1 English(EN) · Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abad\'ia-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric · 2026-04-27 04:00

LayerBoost：用于高效大语言模型的层感知注意力缩减

arXiv:2604.22050v1 Announce Type: cross Abstract: Transformers are mostly relying on softmax attention, which introduces quadratic complexity with respect to sequence length and remains a major bottleneck for efficient inference. Prior work on linear or hybrid attention typically…
arXiv cs.CL TIER_1 English(EN) · Igor Peric · 2026-04-23 20:12

LayerBoost：用于高效大语言模型的层感知注意力缩减

Transformers are mostly relying on softmax attention, which introduces quadratic complexity with respect to sequence length and remains a major bottleneck for efficient inference. Prior work on linear or hybrid attention typically replaces softmax attention uniformly across all l…

报道来源 [5]

AMMA：一种用于低延迟 1M 上下文注意力服务的、多芯粒内存中心架构

面向可扩展长上下文大语言模型服务的稀疏注意力与分层记忆统一

面向可扩展长上下文大模型服务的稀疏注意力与分层记忆统一

LayerBoost：用于高效大语言模型的层感知注意力缩减

LayerBoost：用于高效大语言模型的层感知注意力缩减

相关实体

相关话题