新方法解决长上下文 LLM KV 缓存压缩问题

Apple Machine Learning Research TIER_1 English(EN) · 2026-05-19 00:00

EpiCache：面向资源受限环境的长期对话的偶发式 KV 缓存管理

Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational history. However, the Key-Value (KV) cache grows linearly with the extended dialogue history, causing the model’s memory foo…

arXiv cs.AI TIER_1 English(EN) · Anirudh Sekar · 2026-06-10 04:00

RKSC：多步LLM推理的感知推理KV缓存共享和置信早期退出

arXiv:2606.09937v1 Announce Type: cross Abstract: We introduce RKSC (Reasoning-Aware KV Cache Sharing), a training-free inference framework that eliminates two structural redundancies in multi-branch LLM reasoning pipelines. ASKS (Attention-Similarity KV Sharing) computes the pre…

arXiv cs.AI TIER_1 English(EN) · Bruce Changlong Xu, Adarsh Kumarappan, Mu Zhou · 2026-06-10 04:00

KV 缓存量化下的对齐崩溃：诊断与缓解

arXiv:2606.09864v1 Announce Type: cross Abstract: Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study,…

arXiv cs.AI TIER_1 English(EN) · Wenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei, Xiangyuan Wang, Mengzhe Ruan, Hanxu Hou, Peisong Wang, Linqi Song, Shuang Qiu · 2026-06-10 04:00

ReasonAlloc：用于推理模型的层级解码时KV缓存预算分配

arXiv:2606.11164v1 Announce Type: new Abstract: Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token ev…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-09 17:44

ReasonAlloc：用于推理模型的层级解码时KV缓存预算分配

Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget di…

arXiv cs.AI TIER_1 English(EN) · Shuang Qiu · 2026-06-09 17:44

ReasonAlloc：用于推理模型的层级解码时KV缓存预算分配

Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget di…

arXiv cs.AI TIER_1 English(EN) · Priyansh Bhatnagar, Ashkan Moradifirouzabadi, Se-Hyun Yang, SeungJae Lee, Jungwook Choi, Mingu Kang · 2026-06-09 04:00

STAR-KV：通过软阈值进行低秩 KV 缓存压缩以实现自适应秩控制

arXiv:2606.08382v1 Announce Type: cross Abstract: Low-rank projection has emerged as a promising approach for compressing the KV cache by exploiting hidden-dimension redundancy. However, prior methods rely on fixed or heuristic rank selection and struggle to achieve aggressive co…

arXiv cs.LG TIER_1 English(EN) · Yuji Yamamoto, Satoshi Matsuura · 2026-06-09 04:00

LLM服务系统中共享KV缓存块的比特翻转漏洞

arXiv:2604.17249v2 Announce Type: replace-cross Abstract: Rowhammer on GPU DRAM has enabled adversarial bit flips in model weights; shared KV-cache blocks in LLM serving systems present an analogous but previously unexamined target. In vLLM's Prefix Caching, these blocks exist as…

arXiv cs.LG TIER_1 English(EN) · Yang Pengju · 2026-06-09 04:00

SpectrumKV：用于预填充-解码分离式大语言模型服务的逐 Token 混合精度 KV 缓存传输

arXiv:2606.08635v1 Announce Type: new Abstract: Prefill-decode (PD) disaggregation decouples prompt processing from token generation, but it also turns the key-value (KV) cache into a network payload. Existing PD-side KV reduction methods are mostly binary: selected tokens are tr…

arXiv cs.LG TIER_1 English(EN) · Charles O'Neill, Alex Sandomirsky, Harry Partridge, Mudith Jayasekara, Max Kirkby · 2026-06-09 04:00

Still：单次前向传播中的摊销KV缓存压缩

arXiv:2606.07878v1 Announce Type: new Abstract: The KV cache is the memory bottleneck of long-horizon language model deployment. Practically, a deployable compactor must be lightweight enough to call during inference, expressive enough to preserve context under constraint, and re…

arXiv cs.CL TIER_1 English(EN) · Songhao Wu, Ang Lv, Xiao Feng, Yufei Zhang, Xun Zhang, Guojun Yin, Wei Lin, Rui Yan · 2026-06-08 04:00

PolarQuant：利用极坐标变换实现高效键缓存量化和解码加速

arXiv:2502.00527v2 Announce Type: replace-cross Abstract: The KV cache in large language models is a dominant factor in memory usage, limiting their broader applicability. Quantizing the cache to lower bit widths is an effective way to reduce computational costs; however, previou…

arXiv cs.AI TIER_1 English(EN) · Mingu Kang · 2026-06-07 00:24

STAR-KV：通过软阈值进行低秩 KV 缓存压缩以实现自适应秩控制

Low-rank projection has emerged as a promising approach for compressing the KV cache by exploiting hidden-dimension redundancy. However, prior methods rely on fixed or heuristic rank selection and struggle to achieve aggressive compression with minimal accuracy degradation. We pr…

arXiv cs.CL TIER_1 English(EN) · Chunsheng Zuo, Liaoyaqi Wang, William Jurayj, William Fleshman, Benjamin Van Durme · 2026-06-05 04:00

通过KV缓存压缩的视角重新思考LoRA内存

arXiv:2606.05698v1 Announce Type: new Abstract: Parametric retrieval augmentation encodes document information into lightweight, document-specific modules such as LoRA adapters, reducing the need to include all evidence as input context. However, it remains unclear how this param…

arXiv cs.LG TIER_1 English(EN) · Hyungmin Kim, Minsoo Kim, Hongseok Kim, Jungwook Choi · 2026-06-05 04:00

Tangram：为高效多轮 LLM 服务解锁非均匀 KV 缓存

arXiv:2606.06302v1 Announce Type: new Abstract: Multi-turn Large Language Model (LLM) serving is critical for consistent user experiences, yet the linear growth of the Key-Value (KV) cache imposes significant pressure on GPU memory and bandwidth. Non-uniform KV compression effect…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 15:41

Tangram：为高效多轮 LLM 服务解锁非均匀 KV 缓存

Multi-turn Large Language Model (LLM) serving is critical for consistent user experiences, yet the linear growth of the Key-Value (KV) cache imposes significant pressure on GPU memory and bandwidth. Non-uniform KV compression effectively preserves more information by considering …

arXiv cs.LG TIER_1 English(EN) · Jungwook Choi · 2026-06-04 15:41

Tangram：为高效多轮 LLM 服务解锁非均匀 KV 缓存

Multi-turn Large Language Model (LLM) serving is critical for consistent user experiences, yet the linear growth of the Key-Value (KV) cache imposes significant pressure on GPU memory and bandwidth. Non-uniform KV compression effectively preserves more information by considering …

arXiv cs.CL TIER_1 English(EN) · Momchil Hardalov, Gonzalo Iglesias, Adri\`a de Gispert · 2026-06-04 04:00

大规模的缓存：在大型文档集合上训练模块化 KV 缓存

arXiv:2606.04557v1 Announce Type: new Abstract: Large Language Models can reason over long contexts, yet prefilling millions of tokens is wasteful as much of the content remains static across queries. Cartridges address this by distilling document collections into reusable key-va…

arXiv cs.CL TIER_1 English(EN) · Adrià de Gispert · 2026-06-03 07:42

大规模的缓存：在大型文档集合上训练模块化 KV 缓存

Large Language Models can reason over long contexts, yet prefilling millions of tokens is wasteful as much of the content remains static across queries. Cartridges address this by distilling document collections into reusable key-value (KV) caches that eliminate prefilling while …

arXiv cs.CL TIER_1 English(EN) · Ting-Yun Chang, Harvey Yiyun Fu, Deqing Fu, Chenghao Yang, Jesse Thomason, Robin Jia · 2026-06-03 04:00

面向推理模型的价值感知随机KV缓存淘汰

arXiv:2606.03928v1 Announce Type: cross Abstract: Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache…

arXiv cs.LG TIER_1 English(EN) · Lorenz K. Muller, Philippe Bich, Chiara Boretti, Hyun-Min Chang, Jiawei Zhuang, Lukas Cavigelli · 2026-06-03 04:00

KVarN：方差归一化KV缓存量化可减轻推理任务中的误差累积

arXiv:2606.03458v1 Announce Type: new Abstract: Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but c…

arXiv cs.CL TIER_1 English(EN) · Chunan Shi, Yilei Chen, Yilin Chen, Xupeng Miao, Bin Cui · 2026-06-03 04:00

多段注意力：实现高效 KV 缓存管理，加速大型语言模型服务

arXiv:2606.02964v1 Announce Type: cross Abstract: Large Language Model (LLM) inference relies on key-value (KV) caches to avoid redundant attention computation. While approximate KV cache retention techniques reduce memory usage by sacrificing model accuracy, lossless approaches …

arXiv cs.CL TIER_1 English(EN) · Robin Jia · 2026-06-02 17:16

面向推理模型的价值感知随机KV缓存淘汰

Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selecti…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 17:16

面向推理模型的价值感知随机 KV 缓存驱逐

Value-aware stochastic KV cache eviction method improves reasoning model accuracy under compression by protecting large-magnitude states and promoting diverse eviction decisions.

arXiv cs.LG TIER_1 English(EN) · Lukas Cavigelli · 2026-06-02 10:34

KVarN：方差归一化KV缓存量化可缓解推理任务中的误差累积

Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like …

arXiv cs.AI TIER_1 English(EN) · Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin, Yaojie Zhang, Siteng Huang, Linfeng Zhang · 2026-06-02 04:00

STaR-KV：GUI视觉语言模型中KV缓存压缩的时空自适应重加权

arXiv:2606.01790v1 Announce Type: cross Abstract: Vision-language-model-based graphical user interface (GUI) agents have shown broad automation capabilities, yet deployment is bottlenecked by a key-value (KV) cache that grows linearly with interaction steps. For instance, UI-TARS…

arXiv cs.AI TIER_1 English(EN) · Jinnan Yang, Yan Wang, Zhen Bi, Kehao Wu, Xiaojie Li, Jungang Lou, Zechao Li, Jing Liu · 2026-06-02 04:00

WaveFilter：通过小波引导的KV缓存过滤增强扩散LLM的长上下文能力

arXiv:2606.00724v1 Announce Type: cross Abstract: Diffusion Large Language Models (DLMs) have demonstrated significant advantages across various tasks. However, constrained by their multi-step iterative inference mechanism, their computational overhead and inference latency in lo…

arXiv cs.LG TIER_1 English(EN) · Hyesung Jeon, Hyeongju Ha, Jae-Joon Kim · 2026-06-02 04:00

LRAgent：多LoRA大语言模型代理的高效KV缓存共享

arXiv:2602.01053v2 Announce Type: replace Abstract: Role specialization in multi-LLM agent systems is often realized via multi-LoRA, where agents share a pretrained backbone and differ only by lightweight adapters. Despite sharing base model weights, each agent independently buil…

arXiv cs.LG TIER_1 English(EN) · Yu Li, Binxu Li, Tian Lan · 2026-06-02 04:00

MomentKV：缩小长上下文推理中 KV 缓存驱逐的方向性差距

arXiv:2606.01563v1 Announce Type: new Abstract: Autoregressive decoding in Transformer-based language models relies on the KV cache, whose memory footprint grows linearly with sequence length and becomes the primary bottleneck for long-context inference. KV cache eviction address…

arXiv cs.CL TIER_1 English(EN) · Zican Dong, Peiyu Liu, Junyi Li, Zhipeng Chen, Han Peng, Shuo Wang, Wayne Xin Zhao · 2026-06-02 04:00

ForesightKV：通过学习长期贡献来优化推理模型的 KV 缓存逐出

arXiv:2602.03203v2 Announce Type: replace Abstract: Recently, large language models (LLMs) have shown remarkable reasoning abilities by producing long reasoning traces. However, as the sequence length grows, the key-value (KV) cache expands linearly, incurring significant memory …

arXiv cs.AI TIER_1 English(EN) · Ziyao Tang, Pengkun Jiao, Xinhang Chen, Wei Liu, Shiyong Li, Jingjing Chen · 2026-06-02 04:00

预测未来效用：用于任务无关 KV 缓存驱逐的全局组合优化

arXiv:2602.08585v2 Announce Type: replace-cross Abstract: Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods typically rely on instantaneous heuristic metrics, implicitly assuming that score mag…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 00:00

KVarN：方差归一化KV缓存量化可缓解推理任务中的误差累积

KVarN is a calibration-free KV-cache quantizer that uses Hadamard rotation and dual-scaling variance normalization to reduce error accumulation during autoregressive decoding in large language models.

arXiv cs.LG TIER_1 English(EN) · Luca Benfenati, Matteo Risso, Andrea Vannozzi, Ahmet Caner Y\"uz\"ug\"uler, Lukas Cavigelli, Enrico Macii, Daniele Jahier Pagliari, Alessio Burrello · 2026-06-01 04:00

别再犯Stief的错误了！在Stiefel流形上学习KV Cache低秩近似

arXiv:2601.21686v2 Announce Type: replace Abstract: Key-value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by p…

arXiv cs.CL TIER_1 English(EN) · Vinayshekhar Bannihatti Kumar, Manoj Ghuhan Arivazhagan, Disha Makhija, Rashmi Gangadharaiah · 2026-06-01 04:00

探究 Prompt KV 缓存：何时可有可无

arXiv:2605.30574v1 Announce Type: new Abstract: Prior KV cache compression schemes empirically demonstrate that the prompt cache is partially redundant during decoding, dropping or summarising entries with little accuracy loss. We ask when and what kind of redundancy: at which la…

arXiv cs.CL TIER_1 English(EN) · Junjie Peng, You Wu, Haoyi Wu, Jialong Han, Xiaohua Xie, Kewei Tu, Jianhuang Lai · 2026-06-01 04:00

GRKV：长上下文大语言模型训练免费KV缓存压缩的全局回归

arXiv:2605.31105v1 Announce Type: new Abstract: Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache compressio…

arXiv cs.CL TIER_1 English(EN) · Yanlin Qi, Xinhang Chen, Huiqiang Jiang, Qitong Wang, Botao Peng, Themis Palpanas · 2026-06-01 04:00

ParisKV：面向长上下文 LLM 的快速且抗漂移的 KV-Cache 检索

arXiv:2602.07721v3 Announce Type: replace-cross Abstract: KV-cache retrieval is essential for long-context LLM inference, yet existing methods struggle with distribution drift and high latency at scale. We introduce ParisKV, a drift-robust, GPU-native KV-cache retrieval framework…

arXiv cs.CL TIER_1 English(EN) · Jianhuang Lai · 2026-05-29 10:16

GRKV：长上下文大语言模型训练免费KV缓存压缩的全局回归

Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache compression methods that enforce a fixed budget through ev…

arXiv cs.CL TIER_1 English(EN) · Yuan Feng, Junlin Lv, Haoyu Guo, Yukun Cao, S Kevin Zhou, Xike Xie · 2026-05-29 04:00

CriticalKV：从输出扰动角度优化KV缓存淘汰

arXiv:2502.03805v2 Announce Type: replace Abstract: Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture's reliance on self-attention, particularly the large KV…

arXiv cs.AI TIER_1 English(EN) · Soumyadeep Jana, Sagar Nishad, Sanasam Ranbir Singh · 2026-05-29 04:00

Moment-KV：基于动量的解码时KV缓存压缩以实现长文本生成

arXiv:2605.29873v1 Announce Type: new Abstract: Key-Value (KV) cache remains a major bottleneck for deploying Large Language Models (LLMs) in long-generation tasks. Prior work often applies uniform compression across both prefill and decoding caches, but compressing the prefill c…

arXiv cs.AI TIER_1 English(EN) · Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Hoda Eldardiry, Pinar Yanardag · 2026-05-29 04:00

VideoMLA：用于分钟级自回归视频扩散的低秩潜在KV缓存

arXiv:2605.30351v1 Announce Type: cross Abstract: Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-h…

arXiv cs.AI TIER_1 English(EN) · Pinar Yanardag · 2026-05-28 17:59

VideoMLA：用于分钟级自回归视频扩散的低秩潜在KV缓存

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to st…

arXiv cs.CL TIER_1 English(EN) · Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Hung-Yueh Chiang, Yash Akhauri, Xilai Dai, Huiqiang Jiang, Yucheng Li, Luis Ceze, Kai-Chiang Wu, Mohamed S. Abdelfattah · 2026-05-28 04:00

xKV：通过对齐奇异向量提取实现跨层KV缓存压缩

arXiv:2503.18893v2 Announce Type: replace Abstract: Long-context Large Language Models (LLMs) enable powerful applications but incur high memory costs due to the key-value states (KV-Cache). Recent studies attempt to share KV-Cache across layers, but these approaches either requi…

arXiv cs.CL TIER_1 English(EN) · Wenjie Du, Li Jiang, Keda Tao, Xue Liu, Huan Wang · 2026-05-28 04:00

哪些头对推理至关重要？RL指导的KV缓存压缩

arXiv:2510.08525v3 Announce Type: replace Abstract: Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Exi…

arXiv cs.AI TIER_1 English(EN) · Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, David Cox, Antonio Torralba · 2026-05-28 04:00

Hurwitz 四元数乘法量化用于 KV 缓存压缩

arXiv:2605.27646v1 Announce Type: cross Abstract: We propose \textbf{Hurwitz Quaternion Multiplicative Quantization (HQMQ)}, a \textbf{calibration-free} method for KV cache compression of large language models. HQMQ treats each 4-element chunk of K or V as a quaternion and quanti…

arXiv cs.CL TIER_1 English(EN) · Hong Chen, Xiang Liu, Yubo Gao, Yuxuan Fan, Bo Wang, Yuanlin Chu, Yuanguo Lin, Xuming Hu · 2026-05-27 04:00

NestedKV：用于长上下文 KV 缓存压缩的嵌套内存路由

arXiv:2605.26678v1 Announce Type: new Abstract: Long-context language models are limited by the memory footprint of the key-value (KV) cache. Existing training-free KV compression methods usually rank tokens by one importance signal -- attention, recency, layer-wise allocation, o…

arXiv cs.LG TIER_1 English(EN) · Zejia Qi · 2026-05-27 04:00

LearnedCache：基于 eBPF 集成的感知器式 Linux 页缓存驱逐策略

arXiv:2605.26168v1 Announce Type: cross Abstract: Linux is the foundation of the digital age, accounting for the majority of the cloud and mobile OS markets. Any device that runs Linux uses the Linux page cache, a central pillar in OS and application performance, serving to reduc…

arXiv cs.AI TIER_1 English(EN) · Tuna Tuncer, Felix Becker, Thomas Pfeil · 2026-05-27 04:00

量化键窃取注意力：视频扩散模型KV缓存压缩的偏差校正

arXiv:2605.26266v1 Announce Type: cross Abstract: Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the…

arXiv cs.AI TIER_1 English(EN) · Xintong Yang, Hao Gu, Binxing Xu, Lujun Li, Bei Liu, Jiacheng Liu, Qiyuan Zhu, Sirui Han, Yike Guo · 2026-05-26 04:00

IndexMem：用于长上下文大语言模型推理的学习式KV缓存驱逐与潜在记忆

arXiv:2605.25475v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context infe…

arXiv cs.AI TIER_1 English(EN) · Yubo Li, Yidi Miao · 2026-05-26 04:00

CONF-KV：用于长视野大语言模型的混合精度存储置信度感知KV缓存淘汰

arXiv:2605.24786v1 Announce Type: cross Abstract: Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attentio…

arXiv cs.AI TIER_1 English(EN) · Wei Luo, Yi Huang, Songchen Ma, Huanyu Qu, Jiang Cai, Mingkun Xu · 2026-05-26 04:00

Meta-Soft：利用可组合的Meta-Token进行上下文保留的KV缓存压缩

arXiv:2605.22337v2 Announce Type: replace Abstract: The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts. Current KV Cache eviction has become an important rese…

arXiv cs.CL TIER_1 English(EN) · Yike Guo · 2026-05-25 06:29

IndexMem：基于潜在记忆的KV缓存淘汰策略，用于长上下文大语言模型推理

Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context inference. A practical remedy is to evict less importa…

arXiv cs.LG TIER_1 English(EN) · Yuping Lin, Jiayuan Ding, Yue Xing, Pengfei He, Jiliang Tang, Subhabrata Mukherjee · 2026-05-25 04:00

一种用于改进基于驱逐的KV缓存压缩的简单插件

arXiv:2605.23258v1 Announce Type: new Abstract: KV cache growth is a major bottleneck for long-context inference in large language models. Existing methods are often dominated by binary eviction or representation approximation, which may underutilize tokens that are not critical …

arXiv cs.AI TIER_1 English(EN) · Yu Zhu, Aditya Dhakal, Yunming Xiao, Dejan Milojicic, Gustavo Alonso · 2026-05-25 04:00

ObjectCache: 用于 KV 缓存重用的分层对象存储检索

arXiv:2605.22850v1 Announce Type: cross Abstract: Prefix KV caching has become a key mechanism in LLM serving: it reduces time to first token (TTFT) by avoiding redundant computation across requests that share a prefix (i.e., the system prompt). However, the accumulated KV cache …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-24 00:00

CONF-KV：用于长视域大语言模型的混合精度存储置信度感知KV缓存淘汰

CONF-KV is a KV-cache management system that dynamically adjusts cache retention based on model uncertainty, improving memory efficiency and performance for long-sequence language model inference.

arXiv cs.CL TIER_1 English(EN) · Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross · 2026-05-22 04:00

InnerQ：大型语言模型KV缓存的硬件感知无调优量化

arXiv:2602.23200v2 Announce Type: replace-cross Abstract: When transformer-based language models are deployed for text generation, most of the inference time is spent in the decoding stage, where output tokens are generated sequentially. Reducing the hardware cost of each decodin…

arXiv cs.AI TIER_1 English(EN) · Mark Boss, Vikram Voleti, Simon Donn\'e, Shimon Vainer · 2026-05-22 04:00

OCTOPUS：基于最优平方误差量化下的八面体参数化优化Transformer的KV缓存

arXiv:2605.21226v1 Announce Type: cross Abstract: The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-co…

arXiv cs.LG TIER_1 English(EN) · Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu · 2026-05-22 04:00

CacheClip：通过有效的 KV 缓存重用加速 RAG

arXiv:2510.10129v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-21 11:24

Meta-Soft：利用可组合的元令牌实现上下文保留的KV缓存压缩

The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts.Current KV Cache eviction has become an important research direction; however, existing methods based on fi…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-20 14:19

OCTOPUS：基于最优平方误差量化下的八面体参数化，实现Transformer的优化KV缓存

The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytical…

arXiv cs.AI TIER_1 English(EN) · Shimon Vainer · 2026-05-20 14:19

OCTOPUS：基于最优平方误差量化下的八面体参数化，实现Transformer的优化KV缓存

The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytical…

arXiv cs.CL TIER_1 English(EN) · Ngai Wong · 2026-05-19 10:53

OScaR：LLM及更广泛领域中极端KV缓存量化的奥卡姆剃刀

The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic chan…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-13 00:00

KVServe：面向通信高效的解耦大模型服务的服务感知KV缓存压缩

KVServe is a service-aware and adaptive framework for optimizing key-value communication compression in disaggregated large language model serving, achieving significant improvements in job completion time and time-to-first-token reduction through dynamic optimization.

arXiv cs.CV TIER_1 English(EN) · Weijia Dou, Hui Li, Jiahao Cui, Lei Zhou, Jingdong Wang, Siyu Zhu · 2026-06-01 04:00

SlotMemory：面向流式长视频生成的对象中心KV记忆

arXiv:2605.31033v1 Announce Type: new Abstract: Streaming video generation models typically rely on temporal-centric memory, which organizes historical context as raw frames, chunk segments, or unclustered tokens. This organization frequently leads to identity drift and semantic …

arXiv cs.CV TIER_1 English(EN) · Jiayi Luo, Qiyan Liu, Tengyang Wang, JunHao Liu, Jiayu Chen, Cong Wang, Hanxin Zhu, Chen Gao, Xiaobin Hu, Qingyun Sun, Zhibo Chen · 2026-05-29 04:00

Future Forcing：面向自回归视频生成的未来感知无训练KV缓存策略

arXiv:2605.30083v1 Announce Type: new Abstract: Autoregressive (AR) video generation has emerged as a promising paradigm for long-horizon video synthesis, where each frame is generated conditioned on previously generated tokens. To accelerate inference, the KV cache is used to av…

arXiv cs.CV TIER_1 English(EN) · Zhibo Chen · 2026-05-28 15:30

Future Forcing: 面向自回归视频生成的未来感知无训练KV缓存策略

Autoregressive (AR) video generation has emerged as a promising paradigm for long-horizon video synthesis, where each frame is generated conditioned on previously generated tokens. To accelerate inference, the KV cache is used to avoid redundant recomputation across generation st…

Together AI blog TIER_1 English(EN) · 2026-03-04 00:00

用于高达40%加速长上下文LLM服务的缓存感知预填充-解码分离 (CPD)

Serving long prompts doesn't have to mean slow responses. Learn how Together AI's CPD architecture separates warm and cold inference workloads to deliver 40% higher throughput and dramatically lower time-to-first-token for long-context LLM serving.

MarkTechPost TIER_1 English(EN) · Asif Razzaq · 2026-05-25 21:24

Together AI 开源 OSCAR：一种用于长上下文 LLM 推理的注意力感知 2 位 KV 缓存量化系统

<p>Together AI has released OSCAR (Offline Spectral Covariance-Aware Rotation), an INT2 KV cache quantization method for long-context LLM serving. Unlike prior rotation-based approaches that apply data-oblivious Hadamard transforms, OSCAR derives separate rotations for keys and v…

Towards AI TIER_1 English(EN) · Sumit Vedpathak · 2026-05-25 22:01

无声加速：KV Cache 如何让 AI 感觉即时

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/the-silent-speedup-how-kv-cache-makes-ai-feel-instant-273031a9e6bc?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1536/1*LaoJF_u2lRmOCCpwF-Hoow.png" width=…

Towards AI TIER_1 English(EN) · Armin Norouzi, Ph.D · 2026-05-19 22:01

KV Cache 内部机制：Transformer 如何避免重复计算注意力

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/kv-cache-internals-how-transformers-avoid-recomputing-attention-27672f3382e0?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1000/1*OcvKEGWsIQb7_1mB7C1n9A.p…

r/LocalLLaMA TIER_1 English(EN) · /u/pmttyji · 2026-06-09 19:00

OSCAR RotationZoo - 离线光谱协方差感知旋转用于2位KV缓存量化

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1u1edjb/oscar_rotationzoo_offline_spectral/"> <img alt="OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization" src="https://preview.redd.it/zrjcdd7h0b6h1.png?width=320&…

r/LocalLLaMA TIER_1 English(EN) · /u/Rikers88 · 2026-06-08 11:59

[基准测试] DFlash 推测解码 + KV 缓存压缩在 RTX 5090 上实现 3.26 倍加速

<div class="md"><p><strong>Hardware:</strong> RTX 5090 | <strong>Model:</strong> Qwen3.6-27B | <strong>Framework:</strong> BeeLlama.cpp</p> <p>Full benchmark scripts, raw data, config, and generated artifacts are available on request — just DM or comment below.</p>…

r/LocalLLaMA TIER_1 (CA) · /u/Anbeeld · 2026-06-07 11:54

Qwen 3.6 27B KV缓存量化基准测试：75对，q8/q6/q5/q4，KVarN，Turbo/TCQ

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1tza4ji/qwen_36_27b_kv_cache_quant_benchmarks_75_pairs/"> <img alt="Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ" src="https://preview.redd.it/g981ohkfku5h1.png?width=140&amp…

r/LocalLLaMA TIER_1 English(EN) · /u/AccountAntique9327 · 2026-06-07 03:38

GraphKV，基于图嵌入模型的kv缓存优化

<div class="md"><p>I've been working on a project inspired by TurboQuant, It isnt perfect but it's pretty good for a project I started today, please check it out. <a href="https://github.com/heterodoxin/graphkv">GraphKV</a></p> <table><thead> <tr> <th align="left">…

r/LocalLLaMA TIER_1 English(EN) · /u/Anbeeld · 2026-06-06 18:06

KV缓存量化基准测试：KVarN 6位精度媲美q8_0，4位精度媲美q5_0。太厉害了！

<div class="md"><p><strong>TL;DR Based on long context KLD benchmarks, KVarN appear to be</strong> <strong><em>just better</em></strong> <strong>than usual llama.cpp KV cache quants. At every size, KVarN matches precision of usual quants of one bit higher.</strong>…

dev.to — LLM tag TIER_1 English(EN) · Tech_Nuggets · 2026-06-06 01:10

KV缓存量化：FP8/INT8 K和V实际带来了什么，以及它们在哪里会失效

<h1> KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break </h1> <p>You just deployed a 70B Llama fine-tune on 8x H100s, and your serving box happily handles 200 concurrent 8k contexts. Then product says "can you do 32k?" and suddenly the math stops …

r/LocalLLaMA TIER_1 English(EN) · /u/wadeAlexC · 2026-06-04 18:52

动态 KV 缓存量化与按需加载 mmproj/MTP：我的 llama.cpp 心愿单

<div class="md"><p>We all know the struggle of optimizing your VRAM usage: quantized model, quantized kvcache, mmproj off.</p> <p>I'm often frustrated by the tradeoffs I have to make in these areas. On my RTX 5090, I can fit:</p> <ul> <li>Qwen3.5-27B @ Q6_K</li> <l…

r/LocalLLaMA TIER_1 English(EN) · /u/acluk90 · 2026-06-04 14:47

KVarN：华为推出的新型KV缓存量化技术。KV缓存压缩率达3-5倍，实现实际加速而非减速，且与TurboQuant不同，它在推理方面表现稳定（Apache 2.0，vLLM 单标志）

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1twptw2/kvarn_new_kvcache_quant_from_huawei_35_kv_cache/"> <img alt="KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds u…

r/MachineLearning TIER_1 English(EN) · /u/intentionallyBlue · 2026-06-04 13:21

KVarN: 方差归一化的KV缓存量化 [R]

<div class="md"><p>Excited to share some of my own work here :) </p> <p><strong>KVarN</strong> is our new KV-Cache quantization method. In very brief, we combine Hadamard rotations with variance-normalization <em>on both axes</em> of the K and V matrices, then roun…

r/LocalLLaMA TIER_1 English(EN) · /u/Thrumpwart · 2026-05-26 04:04

Shard - 实现 10 倍 KV 缓存压缩

<div class="md"><p><strong>TL;DR.</strong> <em>Shard</em> is a drop-in HuggingFace Cache that makes Llama-3.1-8B's KV memory about <strong>10×</strong> smaller at 8K context (<strong>11×</strong> at 32K) without measurable hits to NIAH or LongBench. It started as a…

Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-05-25 22:53

Together AI 开源了 OSCAR，一个用于长上下文 LLM 推理的注意力感知 2 位 KV 缓存量化系统。该方法为

Together AI has open-sourced OSCAR, an attention-aware 2-bit KV cache quantisation system for long-context LLM serving. The method derives separate rotations for keys and values from attention-aware covariance structures, reducing the BF16 accuracy gap to just 3.78 points while d…

链接 marktechpost.com/…/together-ai-open-sourc…

r/LocalLLaMA TIER_1 English(EN) · /u/pmttyji · 2026-05-25 11:52

OSCAR RotationZoo - 离线光谱协方差感知旋转用于2位KV缓存量化

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1tn6v0r/oscar_rotationzoo_offline_spectral/"> <img alt="OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization" src="https://preview.redd.it/x4a3z4dgs93h1.jpeg?width=640…

r/LocalLLaMA TIER_1 English(EN) · /u/ayylmaonade · 2026-05-25 02:51

llama.cpp 有一个巧妙的技巧来加速 KV 缓存解码

<div class="md"><p>So, I use llama-server as my endpoint to run local models and connect them to Open-WebUI, Hermes, and OpenCode. But since llama.cpp's webUI has been receiving a lot of updates, I took a look at its settings and noticed a particular one under deve…

报道来源 [81]

相关实体

相关话题