English(EN) KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

新研究通过先进的压缩和存储技术解决大语言模型KV缓存瓶颈

作者 PulseAugur 编辑部 · [21 个来源] · 2026-03-04 00:00

2026年5月发表的多篇研究论文介绍了优化大语言模型键值（KV）缓存的新技术，以解决内存和延迟瓶颈。这些方法包括将KV缓存卸载到S3等对象存储（ObjectCache），采用三向令牌路由（VECTOR）等高级压缩策略，以及使用辅助模型进行选择性KV缓存重新计算（CacheClip）。其他方法侧重于硬件感知量化（InnerQ, OCTOPUS）和服务感知自适应压缩（KVServe），以提高效率并降低解码延迟，尤其适用于长上下文推理和检索增强生成（RAG）系统。 AI

影响 KV缓存优化的这些进展有望显著提高长上下文大语言模型推理的效率和速度，使先进的AI应用更实用且具成本效益。

排序理由 arXiv上发表的多篇研究论文，详细介绍了优化大语言模型KV缓存的新方法。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 21 个来源。我们如何撰写摘要 →

报道来源 [21]

arXiv cs.AI TIER_1 English(EN) · Yubo Li, Yidi Miao · 2026-05-26 04:00

CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM

arXiv:2605.24786v1 Announce Type: cross Abstract: Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attentio…
arXiv cs.AI TIER_1 English(EN) · Wei Luo, Yi Huang, Songchen Ma, Huanyu Qu, Jiang Cai, Mingkun Xu · 2026-05-26 04:00

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

arXiv:2605.22337v2 Announce Type: replace Abstract: The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts. Current KV Cache eviction has become an important rese…
arXiv cs.AI TIER_1 English(EN) · Xintong Yang, Hao Gu, Binxing Xu, Lujun Li, Bei Liu, Jiacheng Liu, Qiyuan Zhu, Sirui Han, Yike Guo · 2026-05-26 04:00

IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

arXiv:2605.25475v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context infe…
arXiv cs.AI TIER_1 English(EN) · Yu Zhu, Aditya Dhakal, Yunming Xiao, Dejan Milojicic, Gustavo Alonso · 2026-05-25 04:00

ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse

arXiv:2605.22850v1 Announce Type: cross Abstract: Prefix KV caching has become a key mechanism in LLM serving: it reduces time to first token (TTFT) by avoiding redundant computation across requests that share a prefix (i.e., the system prompt). However, the accumulated KV cache …
arXiv cs.LG TIER_1 English(EN) · Yuping Lin, Jiayuan Ding, Yue Xing, Pengfei He, Jiliang Tang, Subhabrata Mukherjee · 2026-05-25 04:00

A Simple Plug-in for Improving Eviction-Based KV Cache Compression

arXiv:2605.23258v1 Announce Type: new Abstract: KV cache growth is a major bottleneck for long-context inference in large language models. Existing methods are often dominated by binary eviction or representation approximation, which may underutilize tokens that are not critical …
arXiv cs.LG TIER_1 English(EN) · Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu · 2026-05-22 04:00

CacheClip: Accelerating RAG with Effective KV Cache Reuse

arXiv:2510.10129v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical …
arXiv cs.AI TIER_1 English(EN) · Mark Boss, Vikram Voleti, Simon Donn\'e, Shimon Vainer · 2026-05-22 04:00

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

arXiv:2605.21226v1 Announce Type: cross Abstract: The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-co…
arXiv cs.CL TIER_1 English(EN) · Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross · 2026-05-22 04:00

InnerQ: Hardware-Aware Tuning-Free Quantization of KV Cache for Large Language Models

arXiv:2602.23200v2 Announce Type: replace-cross Abstract: When transformer-based language models are deployed for text generation, most of the inference time is spent in the decoding stage, where output tokens are generated sequentially. Reducing the hardware cost of each decodin…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-21 11:24

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts.Current KV Cache eviction has become an important research direction; however, existing methods based on fi…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-20 14:19

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytical…
arXiv cs.AI TIER_1 English(EN) · Shimon Vainer · 2026-05-20 14:19

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytical…
arXiv cs.CL TIER_1 English(EN) · Ngai Wong · 2026-05-19 10:53

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic chan…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-13 00:00

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

KVServe is a service-aware and adaptive framework for optimizing key-value communication compression in disaggregated large language model serving, achieving significant improvements in job completion time and time-to-first-token reduction through dynamic optimization.
Together AI blog TIER_1 English(EN) · 2026-03-04 00:00

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving

Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token for long-context LLM serving.
MarkTechPost TIER_1 English(EN) · Asif Razzaq · 2026-05-25 21:24

Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

<p>Together AI has released OSCAR (Offline Spectral Covariance-Aware Rotation), an INT2 KV cache quantization method for long-context LLM serving. Unlike prior rotation-based approaches that apply data-oblivious Hadamard transforms, OSCAR derives separate rotations for keys and v…
Towards AI TIER_1 English(EN) · Sumit Vedpathak · 2026-05-25 22:01

The Silent Speedup: How KV Cache Makes AI Feel Instant

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/the-silent-speedup-how-kv-cache-makes-ai-feel-instant-273031a9e6bc?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1536/1*LaoJF_u2lRmOCCpwF-Hoow.png" width=…
Towards AI TIER_1 English(EN) · Armin Norouzi, Ph.D · 2026-05-19 22:01

KV Cache Internals: How Transformers Avoid Recomputing Attention

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/kv-cache-internals-how-transformers-avoid-recomputing-attention-27672f3382e0?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1000/1*OcvKEGWsIQb7_1mB7C1n9A.p…
r/LocalLLaMA TIER_1 English(EN) · /u/Thrumpwart · 2026-05-26 04:04

Shard - getting to 10× KV cache compression

<div class="md"><p><strong>TL;DR.</strong> <em>Shard</em> is a drop-in HuggingFace Cache that makes Llama-3.1-8B's KV memory about <strong>10×</strong> smaller at 8K context (<strong>11×</strong> at 32K) without measurable hits to NIAH or LongBench. It started as a…
Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-05-25 22:53

Together AI has open-sourced OSCAR, an attention-aware 2-bit KV cache quantisation system for long-context LLM serving. The method derives separate rotations fo

Together AI has open-sourced OSCAR, an attention-aware 2-bit KV cache quantisation system for long-context LLM serving. The method derives separate rotations for keys and values from attention-aware covariance structures, reducing the BF16 accuracy gap to just 3.78 points while d…

链接 marktechpost.com/…/together-ai-open-sourc…
r/LocalLLaMA TIER_1 English(EN) · /u/pmttyji · 2026-05-25 11:52

OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1tn6v0r/oscar_rotationzoo_offline_spectral/"> <img alt="OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization" src="https://preview.redd.it/x4a3z4dgs93h1.jpeg?width=640…
r/LocalLLaMA TIER_1 English(EN) · /u/ayylmaonade · 2026-05-25 02:51

llama.cpp has a clever trick for speeding up KV cache decode

<div class="md"><p>So, I use llama-server as my endpoint to run local models and connect them to Open-WebUI, Hermes, and OpenCode. But since llama.cpp's webUI has been receiving a lot of updates, I took a look at its settings and noticed a particular one under deve…

报道来源 [21]

相关实体

相关话题