New research tackles LLM KV cache bottlenecks with advanced compression and storage

By PulseAugur Editorial · [21 sources] · 2026-03-04 00:00

Multiple research papers published in May 2026 introduce novel techniques to optimize the Key-Value (KV) cache in large language models, addressing memory and latency bottlenecks. These methods include offloading KV cache to object storage like S3 (ObjectCache), employing advanced compression strategies like three-way token routing (VECTOR), and using auxiliary models for selective KV cache recomputation (CacheClip). Other approaches focus on hardware-aware quantization (InnerQ, OCTOPUS) and service-aware adaptive compression (KVServe) to improve efficiency and reduce decode latency, especially for long-context inference and retrieval-augmented generation (RAG) systems. AI

IMPACT These advancements in KV cache optimization promise to significantly improve the efficiency and speed of long-context LLM inference, making advanced AI applications more practical and cost-effective.

RANK_REASON Multiple research papers published on arXiv detailing new methods for optimizing KV cache in LLMs.

Read on Hugging Face Daily Papers →

paper
infra

AI-generated summary · Google Gemini · from 21 sources. How we write summaries →

New research tackles LLM KV cache bottlenecks with advanced compression and storage

COVERAGE [21]

arXiv cs.AI TIER_1 English(EN) · Yubo Li, Yidi Miao · 2026-05-26 04:00

CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM

arXiv:2605.24786v1 Announce Type: cross Abstract: Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attentio…
arXiv cs.AI TIER_1 English(EN) · Wei Luo, Yi Huang, Songchen Ma, Huanyu Qu, Jiang Cai, Mingkun Xu · 2026-05-26 04:00

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

arXiv:2605.22337v2 Announce Type: replace Abstract: The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts. Current KV Cache eviction has become an important rese…
arXiv cs.AI TIER_1 English(EN) · Xintong Yang, Hao Gu, Binxing Xu, Lujun Li, Bei Liu, Jiacheng Liu, Qiyuan Zhu, Sirui Han, Yike Guo · 2026-05-26 04:00

IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

arXiv:2605.25475v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context infe…
arXiv cs.AI TIER_1 English(EN) · Yu Zhu, Aditya Dhakal, Yunming Xiao, Dejan Milojicic, Gustavo Alonso · 2026-05-25 04:00

ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse

arXiv:2605.22850v1 Announce Type: cross Abstract: Prefix KV caching has become a key mechanism in LLM serving: it reduces time to first token (TTFT) by avoiding redundant computation across requests that share a prefix (i.e., the system prompt). However, the accumulated KV cache …
arXiv cs.LG TIER_1 English(EN) · Yuping Lin, Jiayuan Ding, Yue Xing, Pengfei He, Jiliang Tang, Subhabrata Mukherjee · 2026-05-25 04:00

A Simple Plug-in for Improving Eviction-Based KV Cache Compression

arXiv:2605.23258v1 Announce Type: new Abstract: KV cache growth is a major bottleneck for long-context inference in large language models. Existing methods are often dominated by binary eviction or representation approximation, which may underutilize tokens that are not critical …
arXiv cs.LG TIER_1 English(EN) · Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu · 2026-05-22 04:00

CacheClip: Accelerating RAG with Effective KV Cache Reuse

arXiv:2510.10129v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical …
arXiv cs.AI TIER_1 English(EN) · Mark Boss, Vikram Voleti, Simon Donn\'e, Shimon Vainer · 2026-05-22 04:00

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

arXiv:2605.21226v1 Announce Type: cross Abstract: The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-co…
arXiv cs.CL TIER_1 English(EN) · Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross · 2026-05-22 04:00

InnerQ: Hardware-Aware Tuning-Free Quantization of KV Cache for Large Language Models

arXiv:2602.23200v2 Announce Type: replace-cross Abstract: When transformer-based language models are deployed for text generation, most of the inference time is spent in the decoding stage, where output tokens are generated sequentially. Reducing the hardware cost of each decodin…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-21 11:24

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts.Current KV Cache eviction has become an important research direction; however, existing methods based on fi…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-20 14:19

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytical…
arXiv cs.AI TIER_1 English(EN) · Shimon Vainer · 2026-05-20 14:19

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytical…
arXiv cs.CL TIER_1 English(EN) · Ngai Wong · 2026-05-19 10:53

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic chan…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-13 00:00

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

KVServe is a service-aware and adaptive framework for optimizing key-value communication compression in disaggregated large language model serving, achieving significant improvements in job completion time and time-to-first-token reduction through dynamic optimization.
Together AI blog TIER_1 English(EN) · 2026-03-04 00:00

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving

Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token for long-context LLM serving.
MarkTechPost TIER_1 English(EN) · Asif Razzaq · 2026-05-25 21:24

Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

<p>Together AI has released OSCAR (Offline Spectral Covariance-Aware Rotation), an INT2 KV cache quantization method for long-context LLM serving. Unlike prior rotation-based approaches that apply data-oblivious Hadamard transforms, OSCAR derives separate rotations for keys and v…
Towards AI TIER_1 English(EN) · Sumit Vedpathak · 2026-05-25 22:01

The Silent Speedup: How KV Cache Makes AI Feel Instant

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/the-silent-speedup-how-kv-cache-makes-ai-feel-instant-273031a9e6bc?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1536/1*LaoJF_u2lRmOCCpwF-Hoow.png" width=…
Towards AI TIER_1 English(EN) · Armin Norouzi, Ph.D · 2026-05-19 22:01

KV Cache Internals: How Transformers Avoid Recomputing Attention

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/kv-cache-internals-how-transformers-avoid-recomputing-attention-27672f3382e0?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1000/1*OcvKEGWsIQb7_1mB7C1n9A.p…
r/LocalLLaMA TIER_1 English(EN) · /u/Thrumpwart · 2026-05-26 04:04

Shard - getting to 10× KV cache compression

<div class="md"><p><strong>TL;DR.</strong> <em>Shard</em> is a drop-in HuggingFace Cache that makes Llama-3.1-8B's KV memory about <strong>10×</strong> smaller at 8K context (<strong>11×</strong> at 32K) without measurable hits to NIAH or LongBench. It started as a…
Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-05-25 22:53

Together AI has open-sourced OSCAR, an attention-aware 2-bit KV cache quantisation system for long-context LLM serving. The method derives separate rotations fo

Together AI has open-sourced OSCAR, an attention-aware 2-bit KV cache quantisation system for long-context LLM serving. The method derives separate rotations for keys and values from attention-aware covariance structures, reducing the BF16 accuracy gap to just 3.78 points while d…

LINKS marktechpost.com/…/together-ai-open-sourc…
r/LocalLLaMA TIER_1 English(EN) · /u/pmttyji · 2026-05-25 11:52

OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1tn6v0r/oscar_rotationzoo_offline_spectral/"> <img alt="OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization" src="https://preview.redd.it/x4a3z4dgs93h1.jpeg?width=640…
r/LocalLLaMA TIER_1 English(EN) · /u/ayylmaonade · 2026-05-25 02:51

llama.cpp has a clever trick for speeding up KV cache decode

<div class="md"><p>So, I use llama-server as my endpoint to run local models and connect them to Open-WebUI, Hermes, and OpenCode. But since llama.cpp's webUI has been receiving a lot of updates, I took a look at its settings and noticed a particular one under deve…

COVERAGE [21]

RELATED ENTITIES

RELATED TOPICS