New research tackles LLM KV cache bottlenecks with advanced compression and storage
ByPulseAugur Editorial·[21 sources]·
Multiple research papers published in May 2026 introduce novel techniques to optimize the Key-Value (KV) cache in large language models, addressing memory and latency bottlenecks. These methods include offloading KV cache to object storage like S3 (ObjectCache), employing advanced compression strategies like three-way token routing (VECTOR), and using auxiliary models for selective KV cache recomputation (CacheClip). Other approaches focus on hardware-aware quantization (InnerQ, OCTOPUS) and service-aware adaptive compression (KVServe) to improve efficiency and reduce decode latency, especially for long-context inference and retrieval-augmented generation (RAG) systems.
AI
IMPACT
These advancements in KV cache optimization promise to significantly improve the efficiency and speed of long-context LLM inference, making advanced AI applications more practical and cost-effective.
RANK_REASON
Multiple research papers published on arXiv detailing new methods for optimizing KV cache in LLMs.
arXiv:2605.24786v1 Announce Type: cross Abstract: Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attentio…
arXiv:2605.22337v2 Announce Type: replace Abstract: The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts. Current KV Cache eviction has become an important rese…
arXiv:2605.25475v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context infe…
arXiv:2605.22850v1 Announce Type: cross Abstract: Prefix KV caching has become a key mechanism in LLM serving: it reduces time to first token (TTFT) by avoiding redundant computation across requests that share a prefix (i.e., the system prompt). However, the accumulated KV cache …
arXiv:2605.23258v1 Announce Type: new Abstract: KV cache growth is a major bottleneck for long-context inference in large language models. Existing methods are often dominated by binary eviction or representation approximation, which may underutilize tokens that are not critical …
arXiv cs.LG
TIER_1English(EN)·Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu·
arXiv:2510.10129v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical …
arXiv cs.AI
TIER_1English(EN)·Mark Boss, Vikram Voleti, Simon Donn\'e, Shimon Vainer·
arXiv:2605.21226v1 Announce Type: cross Abstract: The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-co…
arXiv cs.CL
TIER_1English(EN)·Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross·
arXiv:2602.23200v2 Announce Type: replace-cross Abstract: When transformer-based language models are deployed for text generation, most of the inference time is spent in the decoding stage, where output tokens are generated sequentially. Reducing the hardware cost of each decodin…
The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts.Current KV Cache eviction has become an important research direction; however, existing methods based on fi…
The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytical…
The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytical…
The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic chan…
KVServe is a service-aware and adaptive framework for optimizing key-value communication compression in disaggregated large language model serving, achieving significant improvements in job completion time and time-to-first-token reduction through dynamic optimization.
Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token for long-context LLM serving.
<p>Together AI has released OSCAR (Offline Spectral Covariance-Aware Rotation), an INT2 KV cache quantization method for long-context LLM serving. Unlike prior rotation-based approaches that apply data-oblivious Hadamard transforms, OSCAR derives separate rotations for keys and v…
<!-- SC_OFF --><div class="md"><p><strong>TL;DR.</strong> <em>Shard</em> is a drop-in HuggingFace Cache that makes Llama-3.1-8B's KV memory about <strong>10×</strong> smaller at 8K context (<strong>11×</strong> at 32K) without measurable hits to NIAH or LongBench. It started as a…
Together AI has open-sourced OSCAR, an attention-aware 2-bit KV cache quantisation system for long-context LLM serving. The method derives separate rotations for keys and values from attention-aware covariance structures, reducing the BF16 accuracy gap to just 3.78 points while d…
<!-- SC_OFF --><div class="md"><p>So, I use llama-server as my endpoint to run local models and connect them to Open-WebUI, Hermes, and OpenCode. But since llama.cpp's webUI has been receiving a lot of updates, I took a look at its settings and noticed a particular one under deve…