New methods tackle LLM KV cache compression for long contexts
ByPulseAugur Editorial·[76 sources]·
Multiple research papers released in May and June 2026 propose novel methods for compressing the Key-Value (KV) cache in large language models (LLMs). These techniques aim to reduce the significant memory overhead associated with long context lengths, enabling more efficient inference on resource-constrained environments. Approaches include episodic management, global regression for merging, drift-robust retrieval, and low-rank approximations, all seeking to maintain model accuracy while drastically cutting memory usage and latency.
AI
IMPACT
These methods aim to significantly reduce memory and latency for LLMs, potentially enabling wider deployment and more complex applications on less powerful hardware.
RANK_REASON
Multiple academic papers proposing new methods for LLM KV cache compression.
Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational history. However, the Key-Value (KV) cache grows linearly with the extended dialogue history, causing the model’s memory foo…
arXiv:2606.08382v1 Announce Type: cross Abstract: Low-rank projection has emerged as a promising approach for compressing the KV cache by exploiting hidden-dimension redundancy. However, prior methods rely on fixed or heuristic rank selection and struggle to achieve aggressive co…
arXiv:2604.17249v2 Announce Type: replace-cross Abstract: Rowhammer on GPU DRAM has enabled adversarial bit flips in model weights; shared KV-cache blocks in LLM serving systems present an analogous but previously unexamined target. In vLLM's Prefix Caching, these blocks exist as…
arXiv cs.LG
TIER_1English(EN)·Charles O'Neill, Alex Sandomirsky, Harry Partridge, Mudith Jayasekara, Max Kirkby·
arXiv:2606.07878v1 Announce Type: new Abstract: The KV cache is the memory bottleneck of long-horizon language model deployment. Practically, a deployable compactor must be lightweight enough to call during inference, expressive enough to preserve context under constraint, and re…
arXiv:2606.08635v1 Announce Type: new Abstract: Prefill-decode (PD) disaggregation decouples prompt processing from token generation, but it also turns the key-value (KV) cache into a network payload. Existing PD-side KV reduction methods are mostly binary: selected tokens are tr…
arXiv:2502.00527v2 Announce Type: replace-cross Abstract: The KV cache in large language models is a dominant factor in memory usage, limiting their broader applicability. Quantizing the cache to lower bit widths is an effective way to reduce computational costs; however, previou…
Low-rank projection has emerged as a promising approach for compressing the KV cache by exploiting hidden-dimension redundancy. However, prior methods rely on fixed or heuristic rank selection and struggle to achieve aggressive compression with minimal accuracy degradation. We pr…
arXiv cs.LG
TIER_1English(EN)·Hyungmin Kim, Minsoo Kim, Hongseok Kim, Jungwook Choi·
arXiv:2606.06302v1 Announce Type: new Abstract: Multi-turn Large Language Model (LLM) serving is critical for consistent user experiences, yet the linear growth of the Key-Value (KV) cache imposes significant pressure on GPU memory and bandwidth. Non-uniform KV compression effect…
arXiv cs.CL
TIER_1English(EN)·Chunsheng Zuo, Liaoyaqi Wang, William Jurayj, William Fleshman, Benjamin Van Durme·
arXiv:2606.05698v1 Announce Type: new Abstract: Parametric retrieval augmentation encodes document information into lightweight, document-specific modules such as LoRA adapters, reducing the need to include all evidence as input context. However, it remains unclear how this param…
Multi-turn Large Language Model (LLM) serving is critical for consistent user experiences, yet the linear growth of the Key-Value (KV) cache imposes significant pressure on GPU memory and bandwidth. Non-uniform KV compression effectively preserves more information by considering …
Multi-turn Large Language Model (LLM) serving is critical for consistent user experiences, yet the linear growth of the Key-Value (KV) cache imposes significant pressure on GPU memory and bandwidth. Non-uniform KV compression effectively preserves more information by considering …
arXiv cs.CL
TIER_1English(EN)·Momchil Hardalov, Gonzalo Iglesias, Adri\`a de Gispert·
arXiv:2606.04557v1 Announce Type: new Abstract: Large Language Models can reason over long contexts, yet prefilling millions of tokens is wasteful as much of the content remains static across queries. Cartridges address this by distilling document collections into reusable key-va…
Large Language Models can reason over long contexts, yet prefilling millions of tokens is wasteful as much of the content remains static across queries. Cartridges address this by distilling document collections into reusable key-value (KV) caches that eliminate prefilling while …
arXiv:2606.02964v1 Announce Type: cross Abstract: Large Language Model (LLM) inference relies on key-value (KV) caches to avoid redundant attention computation. While approximate KV cache retention techniques reduce memory usage by sacrificing model accuracy, lossless approaches …
arXiv cs.LG
TIER_1English(EN)·Lorenz K. Muller, Philippe Bich, Chiara Boretti, Hyun-Min Chang, Jiawei Zhuang, Lukas Cavigelli·
arXiv:2606.03458v1 Announce Type: new Abstract: Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but c…
arXiv:2606.03928v1 Announce Type: cross Abstract: Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache…
Value-aware stochastic KV cache eviction method improves reasoning model accuracy under compression by protecting large-magnitude states and promoting diverse eviction decisions.
Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selecti…
Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like …
arXiv:2606.01563v1 Announce Type: new Abstract: Autoregressive decoding in Transformer-based language models relies on the KV cache, whose memory footprint grows linearly with sequence length and becomes the primary bottleneck for long-context inference. KV cache eviction address…
arXiv cs.LG
TIER_1English(EN)·Hyesung Jeon, Hyeongju Ha, Jae-Joon Kim·
arXiv:2602.01053v2 Announce Type: replace Abstract: Role specialization in multi-LLM agent systems is often realized via multi-LoRA, where agents share a pretrained backbone and differ only by lightweight adapters. Despite sharing base model weights, each agent independently buil…
arXiv cs.CL
TIER_1English(EN)·Zican Dong, Peiyu Liu, Junyi Li, Zhipeng Chen, Han Peng, Shuo Wang, Wayne Xin Zhao·
arXiv:2602.03203v2 Announce Type: replace Abstract: Recently, large language models (LLMs) have shown remarkable reasoning abilities by producing long reasoning traces. However, as the sequence length grows, the key-value (KV) cache expands linearly, incurring significant memory …
arXiv:2602.08585v2 Announce Type: replace-cross Abstract: Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods typically rely on instantaneous heuristic metrics, implicitly assuming that score mag…
arXiv:2606.01790v1 Announce Type: cross Abstract: Vision-language-model-based graphical user interface (GUI) agents have shown broad automation capabilities, yet deployment is bottlenecked by a key-value (KV) cache that grows linearly with interaction steps. For instance, UI-TARS…
arXiv:2606.00724v1 Announce Type: cross Abstract: Diffusion Large Language Models (DLMs) have demonstrated significant advantages across various tasks. However, constrained by their multi-step iterative inference mechanism, their computational overhead and inference latency in lo…
KVarN is a calibration-free KV-cache quantizer that uses Hadamard rotation and dual-scaling variance normalization to reduce error accumulation during autoregressive decoding in large language models.
arXiv cs.LG
TIER_1English(EN)·Luca Benfenati, Matteo Risso, Andrea Vannozzi, Ahmet Caner Y\"uz\"ug\"uler, Lukas Cavigelli, Enrico Macii, Daniele Jahier Pagliari, Alessio Burrello·
arXiv:2601.21686v2 Announce Type: replace Abstract: Key-value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by p…
arXiv:2602.07721v3 Announce Type: replace-cross Abstract: KV-cache retrieval is essential for long-context LLM inference, yet existing methods struggle with distribution drift and high latency at scale. We introduce ParisKV, a drift-robust, GPU-native KV-cache retrieval framework…
arXiv:2605.31105v1 Announce Type: new Abstract: Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache compressio…
arXiv:2605.30574v1 Announce Type: new Abstract: Prior KV cache compression schemes empirically demonstrate that the prompt cache is partially redundant during decoding, dropping or summarising entries with little accuracy loss. We ask when and what kind of redundancy: at which la…
Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache compression methods that enforce a fixed budget through ev…
arXiv cs.AI
TIER_1English(EN)·Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Hoda Eldardiry, Pinar Yanardag·
arXiv:2605.30351v1 Announce Type: cross Abstract: Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-h…
arXiv:2605.29873v1 Announce Type: new Abstract: Key-Value (KV) cache remains a major bottleneck for deploying Large Language Models (LLMs) in long-generation tasks. Prior work often applies uniform compression across both prefill and decoding caches, but compressing the prefill c…
arXiv cs.CL
TIER_1English(EN)·Yuan Feng, Junlin Lv, Haoyu Guo, Yukun Cao, S Kevin Zhou, Xike Xie·
arXiv:2502.03805v2 Announce Type: replace Abstract: Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture's reliance on self-attention, particularly the large KV…
Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to st…
arXiv cs.CL
TIER_1English(EN)·Wenjie Du, Li Jiang, Keda Tao, Xue Liu, Huan Wang·
arXiv:2510.08525v3 Announce Type: replace Abstract: Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Exi…
arXiv cs.AI
TIER_1English(EN)·Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, David Cox, Antonio Torralba·
arXiv:2605.27646v1 Announce Type: cross Abstract: We propose \textbf{Hurwitz Quaternion Multiplicative Quantization (HQMQ)}, a \textbf{calibration-free} method for KV cache compression of large language models. HQMQ treats each 4-element chunk of K or V as a quaternion and quanti…
arXiv:2503.18893v2 Announce Type: replace Abstract: Long-context Large Language Models (LLMs) enable powerful applications but incur high memory costs due to the key-value states (KV-Cache). Recent studies attempt to share KV-Cache across layers, but these approaches either requi…
arXiv cs.AI
TIER_1English(EN)·Tuna Tuncer, Felix Becker, Thomas Pfeil·
arXiv:2605.26266v1 Announce Type: cross Abstract: Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the…
arXiv:2605.26168v1 Announce Type: cross Abstract: Linux is the foundation of the digital age, accounting for the majority of the cloud and mobile OS markets. Any device that runs Linux uses the Linux page cache, a central pillar in OS and application performance, serving to reduc…
arXiv:2605.26678v1 Announce Type: new Abstract: Long-context language models are limited by the memory footprint of the key-value (KV) cache. Existing training-free KV compression methods usually rank tokens by one importance signal -- attention, recency, layer-wise allocation, o…
arXiv:2605.22337v2 Announce Type: replace Abstract: The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts. Current KV Cache eviction has become an important rese…
arXiv:2605.25475v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context infe…
arXiv:2605.24786v1 Announce Type: cross Abstract: Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attentio…
Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context inference. A practical remedy is to evict less importa…
arXiv:2605.22850v1 Announce Type: cross Abstract: Prefix KV caching has become a key mechanism in LLM serving: it reduces time to first token (TTFT) by avoiding redundant computation across requests that share a prefix (i.e., the system prompt). However, the accumulated KV cache …
arXiv:2605.23258v1 Announce Type: new Abstract: KV cache growth is a major bottleneck for long-context inference in large language models. Existing methods are often dominated by binary eviction or representation approximation, which may underutilize tokens that are not critical …
CONF-KV is a KV-cache management system that dynamically adjusts cache retention based on model uncertainty, improving memory efficiency and performance for long-sequence language model inference.
arXiv cs.CL
TIER_1English(EN)·Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross·
arXiv:2602.23200v2 Announce Type: replace-cross Abstract: When transformer-based language models are deployed for text generation, most of the inference time is spent in the decoding stage, where output tokens are generated sequentially. Reducing the hardware cost of each decodin…
arXiv cs.LG
TIER_1English(EN)·Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu·
arXiv:2510.10129v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical …
arXiv cs.AI
TIER_1English(EN)·Mark Boss, Vikram Voleti, Simon Donn\'e, Shimon Vainer·
arXiv:2605.21226v1 Announce Type: cross Abstract: The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-co…
The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts.Current KV Cache eviction has become an important research direction; however, existing methods based on fi…
The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytical…
The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytical…
The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic chan…
KVServe is a service-aware and adaptive framework for optimizing key-value communication compression in disaggregated large language model serving, achieving significant improvements in job completion time and time-to-first-token reduction through dynamic optimization.
arXiv:2605.31033v1 Announce Type: new Abstract: Streaming video generation models typically rely on temporal-centric memory, which organizes historical context as raw frames, chunk segments, or unclustered tokens. This organization frequently leads to identity drift and semantic …
arXiv:2605.30083v1 Announce Type: new Abstract: Autoregressive (AR) video generation has emerged as a promising paradigm for long-horizon video synthesis, where each frame is generated conditioned on previously generated tokens. To accelerate inference, the KV cache is used to av…
Autoregressive (AR) video generation has emerged as a promising paradigm for long-horizon video synthesis, where each frame is generated conditioned on previously generated tokens. To accelerate inference, the KV cache is used to avoid redundant recomputation across generation st…
Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token for long-context LLM serving.
<p>Together AI has released OSCAR (Offline Spectral Covariance-Aware Rotation), an INT2 KV cache quantization method for long-context LLM serving. Unlike prior rotation-based approaches that apply data-oblivious Hadamard transforms, OSCAR derives separate rotations for keys and v…
<!-- SC_OFF --><div class="md"><p><strong>Hardware:</strong> RTX 5090 | <strong>Model:</strong> Qwen3.6-27B | <strong>Framework:</strong> BeeLlama.cpp</p> <p>Full benchmark scripts, raw data, config, and generated artifacts are available on request — just DM or comment below.</p>…
<!-- SC_OFF --><div class="md"><p>I've been working on a project inspired by TurboQuant, It isnt perfect but it's pretty good for a project I started today, please check it out. <a href="https://github.com/heterodoxin/graphkv">GraphKV</a></p> <table><thead> <tr> <th align="left">…
<!-- SC_OFF --><div class="md"><p><strong>TL;DR Based on long context KLD benchmarks, KVarN appear to be</strong> <strong><em>just better</em></strong> <strong>than usual llama.cpp KV cache quants. At every size, KVarN matches precision of usual quants of one bit higher.</strong>…
<h1> KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break </h1> <p>You just deployed a 70B Llama fine-tune on 8x H100s, and your serving box happily handles 200 concurrent 8k contexts. Then product says "can you do 32k?" and suddenly the math stops …
<!-- SC_OFF --><div class="md"><p>We all know the struggle of optimizing your VRAM usage: quantized model, quantized kvcache, mmproj off.</p> <p>I'm often frustrated by the tradeoffs I have to make in these areas. On my RTX 5090, I can fit:</p> <ul> <li>Qwen3.5-27B @ Q6_K</li> <l…
<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1twptw2/kvarn_new_kvcache_quant_from_huawei_35_kv_cache/"> <img alt="KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds u…
<!-- SC_OFF --><div class="md"><p>Excited to share some of my own work here :) </p> <p><strong>KVarN</strong> is our new KV-Cache quantization method. In very brief, we combine Hadamard rotations with variance-normalization <em>on both axes</em> of the K and V matrices, then roun…
<!-- SC_OFF --><div class="md"><p><strong>TL;DR.</strong> <em>Shard</em> is a drop-in HuggingFace Cache that makes Llama-3.1-8B's KV memory about <strong>10×</strong> smaller at 8K context (<strong>11×</strong> at 32K) without measurable hits to NIAH or LongBench. It started as a…
Together AI has open-sourced OSCAR, an attention-aware 2-bit KV cache quantisation system for long-context LLM serving. The method derives separate rotations for keys and values from attention-aware covariance structures, reducing the BF16 accuracy gap to just 3.78 points while d…
<!-- SC_OFF --><div class="md"><p>So, I use llama-server as my endpoint to run local models and connect them to Open-WebUI, Hermes, and OpenCode. But since llama.cpp's webUI has been receiving a lot of updates, I took a look at its settings and noticed a particular one under deve…