English(EN) PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

PersistentKV通过新的调度技术优化商品GPU上的LLM服务

作者 PulseAugur 编辑部 · [4 个来源] · 2026-06-25 06:56

一篇新论文介绍PersistentKV，一个旨在优化长上下文大语言模型（LLM）在商品GPU上服务的系统。PersistentKV采用页感知解码调度和原生块表注意力引擎来减少KV缓存碎片并提高吞吐量。与FlashInfer等现有方法相比，该系统在某些工作负载上展示了高达1.4倍的性能提升，并将工作分配确定为LLM服务效率的关键因素。 AI

影响这项研究可能导致在广泛可用的硬件上更高效、更具成本效益地部署长上下文LLM。

排序理由该集群是关于一篇详细介绍LLM服务新系统的研究论文，而非产品发布或重大行业事件。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。我们如何撰写摘要 →

报道来源 [4]

arXiv cs.LG TIER_1 English(EN) · Muhammad Ahmed · 2026-06-26 04:00

PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

arXiv:2606.26666v1 Announce Type: new Abstract: Autoregressive large language model (LLM) serving is increasingly limited by key-value (KV) cache movement rather than dense matrix multiplication. Modern paged-attention systems reduce KV-cache fragmentation and mature kernels such…
arXiv cs.LG TIER_1 English(EN) · Muhammad Ahmed · 2026-06-25 06:56

PersistentKV：面向长上下文大模型在普通GPU上服务的页面感知解码调度

Autoregressive large language model (LLM) serving is increasingly limited by key-value (KV) cache movement rather than dense matrix multiplication. Modern paged-attention systems reduce KV-cache fragmentation and mature kernels such as FlashInfer provide highly optimized native-p…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-25 06:56

PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

Autoregressive large language model (LLM) serving is increasingly limited by key-value (KV) cache movement rather than dense matrix multiplication. Modern paged-attention systems reduce KV-cache fragmentation and mature kernels such as FlashInfer provide highly optimized native-p…
dev.to — LLM tag TIER_1 English(EN) · soy · 2026-06-26 21:33

GPU Overclocking for Local LLMs, Document Transformation, & Lightweight Agentic Apps

<h2> GPU Overclocking for Local LLMs, Document Transformation, & Lightweight Agentic Apps </h2> <h3> Today's Highlights </h3> <p>This week's top stories highlight practical tools for boosting local LLM performance, preparing complex documents for agentic workflows, and buildi…

报道来源 [4]

PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

PersistentKV：面向长上下文大模型在普通GPU上服务的页面感知解码调度

PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

GPU Overclocking for Local LLMs, Document Transformation, & Lightweight Agentic Apps

相关实体

相关话题