PulseAugur
实时 08:35:05
English(EN) PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

PersistentKV通过新的调度技术优化商品GPU上的LLM服务

一篇新论文介绍PersistentKV,一个旨在优化长上下文大语言模型(LLM)在商品GPU上服务的系统。PersistentKV采用页感知解码调度和原生块表注意力引擎来减少KV缓存碎片并提高吞吐量。与FlashInfer等现有方法相比,该系统在某些工作负载上展示了高达1.4倍的性能提升,并将工作分配确定为LLM服务效率的关键因素。 AI

影响 这项研究可能导致在广泛可用的硬件上更高效、更具成本效益地部署长上下文LLM。

排序理由 该集群是关于一篇详细介绍LLM服务新系统的研究论文,而非产品发布或重大行业事件。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →

PersistentKV通过新的调度技术优化商品GPU上的LLM服务

报道来源 [4]

  1. arXiv cs.LG TIER_1 English(EN) · Muhammad Ahmed ·

    PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

    arXiv:2606.26666v1 Announce Type: new Abstract: Autoregressive large language model (LLM) serving is increasingly limited by key-value (KV) cache movement rather than dense matrix multiplication. Modern paged-attention systems reduce KV-cache fragmentation and mature kernels such…

  2. arXiv cs.LG TIER_1 English(EN) · Muhammad Ahmed ·

    PersistentKV:面向长上下文大模型在普通GPU上服务的页面感知解码调度

    Autoregressive large language model (LLM) serving is increasingly limited by key-value (KV) cache movement rather than dense matrix multiplication. Modern paged-attention systems reduce KV-cache fragmentation and mature kernels such as FlashInfer provide highly optimized native-p…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

    Autoregressive large language model (LLM) serving is increasingly limited by key-value (KV) cache movement rather than dense matrix multiplication. Modern paged-attention systems reduce KV-cache fragmentation and mature kernels such as FlashInfer provide highly optimized native-p…

  4. dev.to — LLM tag TIER_1 English(EN) · soy ·

    GPU Overclocking for Local LLMs, Document Transformation, & Lightweight Agentic Apps

    <h2> GPU Overclocking for Local LLMs, Document Transformation, &amp; Lightweight Agentic Apps </h2> <h3> Today's Highlights </h3> <p>This week's top stories highlight practical tools for boosting local LLM performance, preparing complex documents for agentic workflows, and buildi…