新方法解决长上下文 LLM KV 缓存压缩问题

Apple Machine Learning Research TIER_1 English(EN) · 2026-05-19 00:00

EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments

Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational history. However, the Key-Value (KV) cache grows linearly with the extended dialogue history, causing the model’s memory foo…

arXiv cs.AI TIER_1 English(EN) · Anirudh Sekar · 2026-06-10 04:00

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

arXiv:2606.09937v1 Announce Type: cross Abstract: We introduce RKSC (Reasoning-Aware KV Cache Sharing), a training-free inference framework that eliminates two structural redundancies in multi-branch LLM reasoning pipelines. ASKS (Attention-Similarity KV Sharing) computes the pre…

arXiv cs.AI TIER_1 English(EN) · Bruce Changlong Xu, Adarsh Kumarappan, Mu Zhou · 2026-06-10 04:00

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

arXiv:2606.09864v1 Announce Type: cross Abstract: Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study,…

arXiv cs.AI TIER_1 English(EN) · Wenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei, Xiangyuan Wang, Mengzhe Ruan, Hanxu Hou, Peisong Wang, Linqi Song, Shuang Qiu · 2026-06-10 04:00

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

arXiv:2606.11164v1 Announce Type: new Abstract: Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token ev…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-09 17:44

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget di…

arXiv cs.AI TIER_1 English(EN) · Shuang Qiu · 2026-06-09 17:44

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget di…

arXiv cs.AI TIER_1 English(EN) · Priyansh Bhatnagar, Ashkan Moradifirouzabadi, Se-Hyun Yang, SeungJae Lee, Jungwook Choi, Mingu Kang · 2026-06-09 04:00

STAR-KV：通过软阈值进行低秩 KV 缓存压缩以实现自适应秩控制

arXiv:2606.08382v1 Announce Type: cross Abstract: Low-rank projection has emerged as a promising approach for compressing the KV cache by exploiting hidden-dimension redundancy. However, prior methods rely on fixed or heuristic rank selection and struggle to achieve aggressive co…

arXiv cs.LG TIER_1 English(EN) · Yuji Yamamoto, Satoshi Matsuura · 2026-06-09 04:00

LLM服务系统中共享KV缓存块的比特翻转漏洞

arXiv:2604.17249v2 Announce Type: replace-cross Abstract: Rowhammer on GPU DRAM has enabled adversarial bit flips in model weights; shared KV-cache blocks in LLM serving systems present an analogous but previously unexamined target. In vLLM's Prefix Caching, these blocks exist as…

arXiv cs.LG TIER_1 English(EN) · Yang Pengju · 2026-06-09 04:00

SpectrumKV：用于预填充-解码分离式大语言模型服务的逐 Token 混合精度 KV 缓存传输

arXiv:2606.08635v1 Announce Type: new Abstract: Prefill-decode (PD) disaggregation decouples prompt processing from token generation, but it also turns the key-value (KV) cache into a network payload. Existing PD-side KV reduction methods are mostly binary: selected tokens are tr…

arXiv cs.LG TIER_1 English(EN) · Charles O'Neill, Alex Sandomirsky, Harry Partridge, Mudith Jayasekara, Max Kirkby · 2026-06-09 04:00

Still：单次前向传播中的摊销KV缓存压缩

arXiv:2606.07878v1 Announce Type: new Abstract: The KV cache is the memory bottleneck of long-horizon language model deployment. Practically, a deployable compactor must be lightweight enough to call during inference, expressive enough to preserve context under constraint, and re…

arXiv cs.CL TIER_1 English(EN) · Songhao Wu, Ang Lv, Xiao Feng, Yufei Zhang, Xun Zhang, Guojun Yin, Wei Lin, Rui Yan · 2026-06-08 04:00

PolarQuant：利用极坐标变换实现高效键缓存量化和解码加速

arXiv:2502.00527v2 Announce Type: replace-cross Abstract: The KV cache in large language models is a dominant factor in memory usage, limiting their broader applicability. Quantizing the cache to lower bit widths is an effective way to reduce computational costs; however, previou…

arXiv cs.AI TIER_1 English(EN) · Mingu Kang · 2026-06-07 00:24

STAR-KV：通过软阈值进行低秩 KV 缓存压缩以实现自适应秩控制

Low-rank projection has emerged as a promising approach for compressing the KV cache by exploiting hidden-dimension redundancy. However, prior methods rely on fixed or heuristic rank selection and struggle to achieve aggressive compression with minimal accuracy degradation. We pr…

arXiv cs.CL TIER_1 English(EN) · Chunsheng Zuo, Liaoyaqi Wang, William Jurayj, William Fleshman, Benjamin Van Durme · 2026-06-05 04:00

通过KV缓存压缩的视角重新思考LoRA内存

arXiv:2606.05698v1 Announce Type: new Abstract: Parametric retrieval augmentation encodes document information into lightweight, document-specific modules such as LoRA adapters, reducing the need to include all evidence as input context. However, it remains unclear how this param…

arXiv cs.LG TIER_1 English(EN) · Hyungmin Kim, Minsoo Kim, Hongseok Kim, Jungwook Choi · 2026-06-05 04:00

Tangram：为高效多轮 LLM 服务解锁非均匀 KV 缓存

arXiv:2606.06302v1 Announce Type: new Abstract: Multi-turn Large Language Model (LLM) serving is critical for consistent user experiences, yet the linear growth of the Key-Value (KV) cache imposes significant pressure on GPU memory and bandwidth. Non-uniform KV compression effect…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 15:41

Tangram：为高效多轮 LLM 服务解锁非均匀 KV 缓存

Multi-turn Large Language Model (LLM) serving is critical for consistent user experiences, yet the linear growth of the Key-Value (KV) cache imposes significant pressure on GPU memory and bandwidth. Non-uniform KV compression effectively preserves more information by considering …

arXiv cs.LG TIER_1 English(EN) · Jungwook Choi · 2026-06-04 15:41

Tangram：为高效多轮 LLM 服务解锁非均匀 KV 缓存

Multi-turn Large Language Model (LLM) serving is critical for consistent user experiences, yet the linear growth of the Key-Value (KV) cache imposes significant pressure on GPU memory and bandwidth. Non-uniform KV compression effectively preserves more information by considering …

arXiv cs.CL TIER_1 English(EN) · Momchil Hardalov, Gonzalo Iglesias, Adri\`a de Gispert · 2026-06-04 04:00

Cartridges at Scale: Training Modular KV Caches over Large Document Collections

arXiv:2606.04557v1 Announce Type: new Abstract: Large Language Models can reason over long contexts, yet prefilling millions of tokens is wasteful as much of the content remains static across queries. Cartridges address this by distilling document collections into reusable key-va…

arXiv cs.CL TIER_1 English(EN) · Adrià de Gispert · 2026-06-03 07:42

Cartridges at Scale: Training Modular KV Caches over Large Document Collections

Large Language Models can reason over long contexts, yet prefilling millions of tokens is wasteful as much of the content remains static across queries. Cartridges address this by distilling document collections into reusable key-value (KV) caches that eliminate prefilling while …

arXiv cs.CL TIER_1 English(EN) · Ting-Yun Chang, Harvey Yiyun Fu, Deqing Fu, Chenghao Yang, Jesse Thomason, Robin Jia · 2026-06-03 04:00

Value-Aware Stochastic KV Cache Eviction for Reasoning Models

arXiv:2606.03928v1 Announce Type: cross Abstract: Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache…

arXiv cs.LG TIER_1 English(EN) · Lorenz K. Muller, Philippe Bich, Chiara Boretti, Hyun-Min Chang, Jiawei Zhuang, Lukas Cavigelli · 2026-06-03 04:00

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

arXiv:2606.03458v1 Announce Type: new Abstract: Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but c…

arXiv cs.CL TIER_1 English(EN) · Chunan Shi, Yilei Chen, Yilin Chen, Xupeng Miao, Bin Cui · 2026-06-03 04:00

Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving

arXiv:2606.02964v1 Announce Type: cross Abstract: Large Language Model (LLM) inference relies on key-value (KV) caches to avoid redundant attention computation. While approximate KV cache retention techniques reduce memory usage by sacrificing model accuracy, lossless approaches …

arXiv cs.CL TIER_1 English(EN) · Robin Jia · 2026-06-02 17:16

面向推理模型的价值感知随机KV缓存淘汰

Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selecti…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 17:16

面向推理模型的价值感知随机 KV 缓存驱逐

Value-aware stochastic KV cache eviction method improves reasoning model accuracy under compression by protecting large-magnitude states and promoting diverse eviction decisions.

arXiv cs.LG TIER_1 English(EN) · Lukas Cavigelli · 2026-06-02 10:34

KVarN：方差归一化KV缓存量化可缓解推理任务中的误差累积

Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like …

arXiv cs.AI TIER_1 English(EN) · Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin, Yaojie Zhang, Siteng Huang, Linfeng Zhang · 2026-06-02 04:00

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

arXiv:2606.01790v1 Announce Type: cross Abstract: Vision-language-model-based graphical user interface (GUI) agents have shown broad automation capabilities, yet deployment is bottlenecked by a key-value (KV) cache that grows linearly with interaction steps. For instance, UI-TARS…

arXiv cs.AI TIER_1 English(EN) · Jinnan Yang, Yan Wang, Zhen Bi, Kehao Wu, Xiaojie Li, Jungang Lou, Zechao Li, Jing Liu · 2026-06-02 04:00

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

arXiv:2606.00724v1 Announce Type: cross Abstract: Diffusion Large Language Models (DLMs) have demonstrated significant advantages across various tasks. However, constrained by their multi-step iterative inference mechanism, their computational overhead and inference latency in lo…

arXiv cs.LG TIER_1 English(EN) · Hyesung Jeon, Hyeongju Ha, Jae-Joon Kim · 2026-06-02 04:00

LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents

arXiv:2602.01053v2 Announce Type: replace Abstract: Role specialization in multi-LLM agent systems is often realized via multi-LoRA, where agents share a pretrained backbone and differ only by lightweight adapters. Despite sharing base model weights, each agent independently buil…

arXiv cs.LG TIER_1 English(EN) · Yu Li, Binxu Li, Tian Lan · 2026-06-02 04:00

MomentKV: Closing the Directional Gap in KV Cache Eviction for Long-Context Inference

arXiv:2606.01563v1 Announce Type: new Abstract: Autoregressive decoding in Transformer-based language models relies on the KV cache, whose memory footprint grows linearly with sequence length and becomes the primary bottleneck for long-context inference. KV cache eviction address…

arXiv cs.CL TIER_1 English(EN) · Zican Dong, Peiyu Liu, Junyi Li, Zhipeng Chen, Han Peng, Shuo Wang, Wayne Xin Zhao · 2026-06-02 04:00

ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution

arXiv:2602.03203v2 Announce Type: replace Abstract: Recently, large language models (LLMs) have shown remarkable reasoning abilities by producing long reasoning traces. However, as the sequence length grows, the key-value (KV) cache expands linearly, incurring significant memory …

arXiv cs.AI TIER_1 English(EN) · Ziyao Tang, Pengkun Jiao, Xinhang Chen, Wei Liu, Shiyong Li, Jingjing Chen · 2026-06-02 04:00

Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

arXiv:2602.08585v2 Announce Type: replace-cross Abstract: Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods typically rely on instantaneous heuristic metrics, implicitly assuming that score mag…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 00:00

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

KVarN is a calibration-free KV-cache quantizer that uses Hadamard rotation and dual-scaling variance normalization to reduce error accumulation during autoregressive decoding in large language models.

arXiv cs.LG TIER_1 English(EN) · Luca Benfenati, Matteo Risso, Andrea Vannozzi, Ahmet Caner Y\"uz\"ug\"uler, Lukas Cavigelli, Enrico Macii, Daniele Jahier Pagliari, Alessio Burrello · 2026-06-01 04:00

Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold

arXiv:2601.21686v2 Announce Type: replace Abstract: Key-value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by p…

arXiv cs.CL TIER_1 English(EN) · Vinayshekhar Bannihatti Kumar, Manoj Ghuhan Arivazhagan, Disha Makhija, Rashmi Gangadharaiah · 2026-06-01 04:00

Probing the Prompt KV Cache: Where It Becomes Dispensable

arXiv:2605.30574v1 Announce Type: new Abstract: Prior KV cache compression schemes empirically demonstrate that the prompt cache is partially redundant during decoding, dropping or summarising entries with little accuracy loss. We ask when and what kind of redundancy: at which la…

arXiv cs.CL TIER_1 English(EN) · Junjie Peng, You Wu, Haoyi Wu, Jialong Han, Xiaohua Xie, Kewei Tu, Jianhuang Lai · 2026-06-01 04:00

GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

arXiv:2605.31105v1 Announce Type: new Abstract: Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache compressio…

arXiv cs.CL TIER_1 English(EN) · Yanlin Qi, Xinhang Chen, Huiqiang Jiang, Qitong Wang, Botao Peng, Themis Palpanas · 2026-06-01 04:00

ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs

arXiv:2602.07721v3 Announce Type: replace-cross Abstract: KV-cache retrieval is essential for long-context LLM inference, yet existing methods struggle with distribution drift and high latency at scale. We introduce ParisKV, a drift-robust, GPU-native KV-cache retrieval framework…

arXiv cs.CL TIER_1 English(EN) · Jianhuang Lai · 2026-05-29 10:16

GRKV：长上下文大语言模型训练免费KV缓存压缩的全局回归

Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache compression methods that enforce a fixed budget through ev…

arXiv cs.CL TIER_1 English(EN) · Yuan Feng, Junlin Lv, Haoyu Guo, Yukun Cao, S Kevin Zhou, Xike Xie · 2026-05-29 04:00

CriticalKV: Optimizing KV Cache Eviction from an Output Perturbation Perspective

arXiv:2502.03805v2 Announce Type: replace Abstract: Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture's reliance on self-attention, particularly the large KV…

arXiv cs.AI TIER_1 English(EN) · Soumyadeep Jana, Sagar Nishad, Sanasam Ranbir Singh · 2026-05-29 04:00

Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation

arXiv:2605.29873v1 Announce Type: new Abstract: Key-Value (KV) cache remains a major bottleneck for deploying Large Language Models (LLMs) in long-generation tasks. Prior work often applies uniform compression across both prefill and decoding caches, but compressing the prefill c…

arXiv cs.AI TIER_1 English(EN) · Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Hoda Eldardiry, Pinar Yanardag · 2026-05-29 04:00

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

arXiv:2605.30351v1 Announce Type: cross Abstract: Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-h…

arXiv cs.AI TIER_1 English(EN) · Pinar Yanardag · 2026-05-28 17:59

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to st…

arXiv cs.CL TIER_1 English(EN) · Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Hung-Yueh Chiang, Yash Akhauri, Xilai Dai, Huiqiang Jiang, Yucheng Li, Luis Ceze, Kai-Chiang Wu, Mohamed S. Abdelfattah · 2026-05-28 04:00

xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction

arXiv:2503.18893v2 Announce Type: replace Abstract: Long-context Large Language Models (LLMs) enable powerful applications but incur high memory costs due to the key-value states (KV-Cache). Recent studies attempt to share KV-Cache across layers, but these approaches either requi…

arXiv cs.CL TIER_1 English(EN) · Wenjie Du, Li Jiang, Keda Tao, Xue Liu, Huan Wang · 2026-05-28 04:00

Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

arXiv:2510.08525v3 Announce Type: replace Abstract: Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Exi…

arXiv cs.AI TIER_1 English(EN) · Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, David Cox, Antonio Torralba · 2026-05-28 04:00

Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression

arXiv:2605.27646v1 Announce Type: cross Abstract: We propose \textbf{Hurwitz Quaternion Multiplicative Quantization (HQMQ)}, a \textbf{calibration-free} method for KV cache compression of large language models. HQMQ treats each 4-element chunk of K or V as a quaternion and quanti…

arXiv cs.CL TIER_1 English(EN) · Hong Chen, Xiang Liu, Yubo Gao, Yuxuan Fan, Bo Wang, Yuanlin Chu, Yuanguo Lin, Xuming Hu · 2026-05-27 04:00

NestedKV: Nested Memory Routing for Long-Context KV Cache Compression

arXiv:2605.26678v1 Announce Type: new Abstract: Long-context language models are limited by the memory footprint of the key-value (KV) cache. Existing training-free KV compression methods usually rank tokens by one importance signal -- attention, recency, layer-wise allocation, o…

arXiv cs.LG TIER_1 English(EN) · Zejia Qi · 2026-05-27 04:00

LearnedCache: An eBPF-Integrated Perceptron-Based Eviction Policy for the Linux Page Cache

arXiv:2605.26168v1 Announce Type: cross Abstract: Linux is the foundation of the digital age, accounting for the majority of the cloud and mobile OS markets. Any device that runs Linux uses the Linux page cache, a central pillar in OS and application performance, serving to reduc…

arXiv cs.AI TIER_1 English(EN) · Tuna Tuncer, Felix Becker, Thomas Pfeil · 2026-05-27 04:00

Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

arXiv:2605.26266v1 Announce Type: cross Abstract: Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the…

arXiv cs.AI TIER_1 English(EN) · Xintong Yang, Hao Gu, Binxing Xu, Lujun Li, Bei Liu, Jiacheng Liu, Qiyuan Zhu, Sirui Han, Yike Guo · 2026-05-26 04:00

IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

arXiv:2605.25475v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context infe…

arXiv cs.AI TIER_1 English(EN) · Yubo Li, Yidi Miao · 2026-05-26 04:00

CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM

arXiv:2605.24786v1 Announce Type: cross Abstract: Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attentio…

arXiv cs.AI TIER_1 English(EN) · Wei Luo, Yi Huang, Songchen Ma, Huanyu Qu, Jiang Cai, Mingkun Xu · 2026-05-26 04:00

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

arXiv:2605.22337v2 Announce Type: replace Abstract: The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts. Current KV Cache eviction has become an important rese…

arXiv cs.CL TIER_1 English(EN) · Yike Guo · 2026-05-25 06:29

IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context inference. A practical remedy is to evict less importa…

arXiv cs.LG TIER_1 English(EN) · Yuping Lin, Jiayuan Ding, Yue Xing, Pengfei He, Jiliang Tang, Subhabrata Mukherjee · 2026-05-25 04:00

A Simple Plug-in for Improving Eviction-Based KV Cache Compression

arXiv:2605.23258v1 Announce Type: new Abstract: KV cache growth is a major bottleneck for long-context inference in large language models. Existing methods are often dominated by binary eviction or representation approximation, which may underutilize tokens that are not critical …

arXiv cs.AI TIER_1 English(EN) · Yu Zhu, Aditya Dhakal, Yunming Xiao, Dejan Milojicic, Gustavo Alonso · 2026-05-25 04:00

ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse

arXiv:2605.22850v1 Announce Type: cross Abstract: Prefix KV caching has become a key mechanism in LLM serving: it reduces time to first token (TTFT) by avoiding redundant computation across requests that share a prefix (i.e., the system prompt). However, the accumulated KV cache …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-24 00:00

CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM

CONF-KV is a KV-cache management system that dynamically adjusts cache retention based on model uncertainty, improving memory efficiency and performance for long-sequence language model inference.

arXiv cs.CL TIER_1 English(EN) · Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross · 2026-05-22 04:00

InnerQ: Hardware-Aware Tuning-Free Quantization of KV Cache for Large Language Models

arXiv:2602.23200v2 Announce Type: replace-cross Abstract: When transformer-based language models are deployed for text generation, most of the inference time is spent in the decoding stage, where output tokens are generated sequentially. Reducing the hardware cost of each decodin…

arXiv cs.AI TIER_1 English(EN) · Mark Boss, Vikram Voleti, Simon Donn\'e, Shimon Vainer · 2026-05-22 04:00

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

arXiv:2605.21226v1 Announce Type: cross Abstract: The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-co…

arXiv cs.LG TIER_1 English(EN) · Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu · 2026-05-22 04:00

CacheClip: Accelerating RAG with Effective KV Cache Reuse

arXiv:2510.10129v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-21 11:24

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts.Current KV Cache eviction has become an important research direction; however, existing methods based on fi…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-20 14:19

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytical…

arXiv cs.AI TIER_1 English(EN) · Shimon Vainer · 2026-05-20 14:19

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytical…

arXiv cs.CL TIER_1 English(EN) · Ngai Wong · 2026-05-19 10:53

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic chan…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-13 00:00

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

KVServe is a service-aware and adaptive framework for optimizing key-value communication compression in disaggregated large language model serving, achieving significant improvements in job completion time and time-to-first-token reduction through dynamic optimization.

arXiv cs.CV TIER_1 English(EN) · Weijia Dou, Hui Li, Jiahao Cui, Lei Zhou, Jingdong Wang, Siyu Zhu · 2026-06-01 04:00

SlotMemory: Object-Centric KV Memory for Streaming Long-Video Generation

arXiv:2605.31033v1 Announce Type: new Abstract: Streaming video generation models typically rely on temporal-centric memory, which organizes historical context as raw frames, chunk segments, or unclustered tokens. This organization frequently leads to identity drift and semantic …

arXiv cs.CV TIER_1 English(EN) · Jiayi Luo, Qiyan Liu, Tengyang Wang, JunHao Liu, Jiayu Chen, Cong Wang, Hanxin Zhu, Chen Gao, Xiaobin Hu, Qingyun Sun, Zhibo Chen · 2026-05-29 04:00

Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation

arXiv:2605.30083v1 Announce Type: new Abstract: Autoregressive (AR) video generation has emerged as a promising paradigm for long-horizon video synthesis, where each frame is generated conditioned on previously generated tokens. To accelerate inference, the KV cache is used to av…

arXiv cs.CV TIER_1 English(EN) · Zhibo Chen · 2026-05-28 15:30

Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation

Autoregressive (AR) video generation has emerged as a promising paradigm for long-horizon video synthesis, where each frame is generated conditioned on previously generated tokens. To accelerate inference, the KV cache is used to avoid redundant recomputation across generation st…

Together AI blog TIER_1 English(EN) · 2026-03-04 00:00

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster long-context LLM serving

Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token for long-context LLM serving.

MarkTechPost TIER_1 English(EN) · Asif Razzaq · 2026-05-25 21:24

Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

<p>Together AI has released OSCAR (Offline Spectral Covariance-Aware Rotation), an INT2 KV cache quantization method for long-context LLM serving. Unlike prior rotation-based approaches that apply data-oblivious Hadamard transforms, OSCAR derives separate rotations for keys and v…

Towards AI TIER_1 English(EN) · Sumit Vedpathak · 2026-05-25 22:01

The Silent Speedup: How KV Cache Makes AI Feel Instant

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/the-silent-speedup-how-kv-cache-makes-ai-feel-instant-273031a9e6bc?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1536/1*LaoJF_u2lRmOCCpwF-Hoow.png" width=…

Towards AI TIER_1 English(EN) · Armin Norouzi, Ph.D · 2026-05-19 22:01

KV Cache Internals: How Transformers Avoid Recomputing Attention

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/kv-cache-internals-how-transformers-avoid-recomputing-attention-27672f3382e0?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1000/1*OcvKEGWsIQb7_1mB7C1n9A.p…

r/LocalLLaMA TIER_1 English(EN) · /u/pmttyji · 2026-06-09 19:00

OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1u1edjb/oscar_rotationzoo_offline_spectral/"> <img alt="OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization" src="https://preview.redd.it/zrjcdd7h0b6h1.png?width=320&…

r/LocalLLaMA TIER_1 English(EN) · /u/Rikers88 · 2026-06-08 11:59

[基准测试] DFlash 推测解码 + KV 缓存压缩在 RTX 5090 上实现 3.26 倍加速

<div class="md"><p><strong>Hardware:</strong> RTX 5090 | <strong>Model:</strong> Qwen3.6-27B | <strong>Framework:</strong> BeeLlama.cpp</p> <p>Full benchmark scripts, raw data, config, and generated artifacts are available on request — just DM or comment below.</p>…

r/LocalLLaMA TIER_1 (CA) · /u/Anbeeld · 2026-06-07 11:54

Qwen 3.6 27B KV缓存量化基准测试：75对，q8/q6/q5/q4，KVarN，Turbo/TCQ

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1tza4ji/qwen_36_27b_kv_cache_quant_benchmarks_75_pairs/"> <img alt="Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ" src="https://preview.redd.it/g981ohkfku5h1.png?width=140&amp…

r/LocalLLaMA TIER_1 English(EN) · /u/AccountAntique9327 · 2026-06-07 03:38

GraphKV，基于图嵌入模型的kv缓存优化

<div class="md"><p>I've been working on a project inspired by TurboQuant, It isnt perfect but it's pretty good for a project I started today, please check it out. <a href="https://github.com/heterodoxin/graphkv">GraphKV</a></p> <table><thead> <tr> <th align="left">…

r/LocalLLaMA TIER_1 English(EN) · /u/Anbeeld · 2026-06-06 18:06

KV缓存量化基准测试：KVarN 6位精度媲美q8_0，4位精度媲美q5_0。太厉害了！

<div class="md"><p><strong>TL;DR Based on long context KLD benchmarks, KVarN appear to be</strong> <strong><em>just better</em></strong> <strong>than usual llama.cpp KV cache quants. At every size, KVarN matches precision of usual quants of one bit higher.</strong>…

dev.to — LLM tag TIER_1 English(EN) · Tech_Nuggets · 2026-06-06 01:10

KV缓存量化：FP8/INT8 K和V实际带来了什么，以及它们在哪里会失效

<h1> KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break </h1> <p>You just deployed a 70B Llama fine-tune on 8x H100s, and your serving box happily handles 200 concurrent 8k contexts. Then product says "can you do 32k?" and suddenly the math stops …

r/LocalLLaMA TIER_1 English(EN) · /u/wadeAlexC · 2026-06-04 18:52

Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist

<div class="md"><p>We all know the struggle of optimizing your VRAM usage: quantized model, quantized kvcache, mmproj off.</p> <p>I'm often frustrated by the tradeoffs I have to make in these areas. On my RTX 5090, I can fit:</p> <ul> <li>Qwen3.5-27B @ Q6_K</li> <l…

r/LocalLLaMA TIER_1 English(EN) · /u/acluk90 · 2026-06-04 14:47

KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag)

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1twptw2/kvarn_new_kvcache_quant_from_huawei_35_kv_cache/"> <img alt="KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds u…

r/MachineLearning TIER_1 English(EN) · /u/intentionallyBlue · 2026-06-04 13:21

KVarN: Variance-Normalized KV-Cache Quantization [R]

<div class="md"><p>Excited to share some of my own work here :) </p> <p><strong>KVarN</strong> is our new KV-Cache quantization method. In very brief, we combine Hadamard rotations with variance-normalization <em>on both axes</em> of the K and V matrices, then roun…

r/LocalLLaMA TIER_1 English(EN) · /u/Thrumpwart · 2026-05-26 04:04

Shard - getting to 10× KV cache compression

<div class="md"><p><strong>TL;DR.</strong> <em>Shard</em> is a drop-in HuggingFace Cache that makes Llama-3.1-8B's KV memory about <strong>10×</strong> smaller at 8K context (<strong>11×</strong> at 32K) without measurable hits to NIAH or LongBench. It started as a…

Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-05-25 22:53

Together AI has open-sourced OSCAR, an attention-aware 2-bit KV cache quantisation system for long-context LLM serving. The method derives separate rotations fo

Together AI has open-sourced OSCAR, an attention-aware 2-bit KV cache quantisation system for long-context LLM serving. The method derives separate rotations for keys and values from attention-aware covariance structures, reducing the BF16 accuracy gap to just 3.78 points while d…

链接 marktechpost.com/…/together-ai-open-sourc…

r/LocalLLaMA TIER_1 English(EN) · /u/pmttyji · 2026-05-25 11:52

OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1tn6v0r/oscar_rotationzoo_offline_spectral/"> <img alt="OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization" src="https://preview.redd.it/x4a3z4dgs93h1.jpeg?width=640…

r/LocalLLaMA TIER_1 English(EN) · /u/ayylmaonade · 2026-05-25 02:51

llama.cpp has a clever trick for speeding up KV cache decode

<div class="md"><p>So, I use llama-server as my endpoint to run local models and connect them to Open-WebUI, Hermes, and OpenCode. But since llama.cpp's webUI has been receiving a lot of updates, I took a look at its settings and noticed a particular one under deve…

报道来源 [81]

相关实体

相关话题