English(EN) RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

新系统应对LLM长上下文服务瓶颈

作者 PulseAugur 编辑部 · [4 个来源] · 2026-06-04 14:57

两篇新研究论文介绍了管理KV缓存的新方法，KV缓存是在服务具有长上下文的大型语言模型时的关键瓶颈。RedKnot提出了一种头感知的KV缓存管理系统，该系统根据注意力头的注意力和有效范围对缓存进行分解，从而提高资源效率和可扩展性。TokenMizer将对话历史建模为图结构知识图，通过保留关系结构实现了显著的令牌经济和更高的决策召回率。 AI

影响这些系统旨在提高LLM服务的效率和可扩展性，可能支持更复杂和更长上下文的应用。

排序理由两篇学术论文提出了LLM基础设施的新方法。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。我们如何撰写摘要 →

报道来源 [4]

arXiv cs.AI TIER_1 English(EN) · Yang Liu, ZhaoKai Luo, HuaYi Jin, ZhiYong Wang, RuoZhou He, BoYu Wang, Guanjie Chen, Junhao Hu · 2026-06-06 04:00

RedKnot：利用头感知KV重用和SegPagedAttention实现高效长上下文LLM服务

arXiv:2606.06256v1 Announce Type: new Abstract: As the input length of large language model (LLM) serving continues to grow, the KV cache has become a dominant bottleneck in AI infrastructure. It limits GPU memory capacity, serving concurrency, cache reuse, and distributed scalab…
arXiv cs.AI TIER_1 English(EN) · Shweta Mishra · 2026-06-06 04:00

TokenMizer：用于长时LLM上下文管理的图结构化会话内存

arXiv:2606.06337v1 Announce Type: new Abstract: Large language model (LLM) deployments for long-horizon tasks face a fundamental constraint: context windows are finite while productive work sessions are not. When history exceeds the Maximum Effective Context Window (MECW), critic…
arXiv cs.AI TIER_1 English(EN) · Shweta Mishra · 2026-06-04 16:12

TokenMizer：用于长时LLM上下文管理的图结构会话内存

Large language model (LLM) deployments for long-horizon tasks face a fundamental constraint: context windows are finite while productive work sessions are not. When history exceeds the Maximum Effective Context Window (MECW), critical structured information - architectural decisi…
arXiv cs.AI TIER_1 English(EN) · Junhao Hu · 2026-06-04 14:57

RedKnot：通过头感知KV重用和SegPagedAttention实现高效长上下文LLM服务

As the input length of large language model (LLM) serving continues to grow, the KV cache has become a dominant bottleneck in AI infrastructure. It limits GPU memory capacity, serving concurrency, cache reuse, and distributed scalability. Several important problems, including pos…

报道来源 [4]

RedKnot：利用头感知KV重用和SegPagedAttention实现高效长上下文LLM服务

TokenMizer：用于长时LLM上下文管理的图结构化会话内存

TokenMizer：用于长时LLM上下文管理的图结构会话内存

RedKnot：通过头感知KV重用和SegPagedAttention实现高效长上下文LLM服务

相关实体

相关话题