English(EN) HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

HeadQ: 模型可见失真与分数空间校正用于KV缓存量化

作者 PulseAugur 编辑部 · [11 个来源] · 2026-05-04 06:17

研究人员正在开发几种新颖的方法来优化大型语言模型中的键值（KV）缓存，这是长上下文处理的主要瓶颈。这些方法包括训练模型内在生成可压缩表示（KV-CAT）、操纵潜在注意力空间以实现高效引导（Memory Inception）以及采用先进的量化技术，如int4和谱去噪（eOptShrinkQ、HeadQ）。此外，用于多模态模型的WindowQuant和用于分布式KV缓存管理的tierKV等新策略旨在减少延迟和内存使用，其中tierKV甚至比GPU缓存命中更快地恢复被驱逐的块。 AI

影响新的KV缓存优化技术有望显著降低LLM的推理延迟和内存使用，从而实现更长的上下文和更快的处理速度。

排序理由多篇研究论文提出了LLM中KV缓存优化的新颖技术。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 11 个来源。我们如何撰写摘要 →

报道来源 [11]

arXiv cs.AI TIER_1 English(EN) · Mohamed Amine Bergach · 2026-05-08 04:00

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

arXiv:2605.05699v1 Announce Type: cross Abstract: KV-cache quantization is framed as a quality--latency trade-off. We show it is \emph{inverted} on Apple Silicon's unified memory: a single fused Metal kernel (sign-randomized FFT $+$ per-channel $\lambda$ $+$ per-group abs-max $+$…
arXiv cs.LG TIER_1 English(EN) · Yoav Gelberg, Yam Eitan, Michael Bronstein, Yarin Gal, Haggai Maron · 2026-05-08 04:00

Training Transformers for KV Cache Compressibility

arXiv:2605.05971v1 Announce Type: new Abstract: Long-context language modeling is increasingly constrained by the Key-Value (KV) cache, whose memory and decode-time access costs scale linearly with the prefix length. This bottleneck has motivated a range of context-compression me…
arXiv cs.LG TIER_1 English(EN) · Andy Zeyi Liu, Michael Zhang, Ilana Greenberg, Adam Alnasser, Lucas Baker, John Sous · 2026-05-08 04:00

Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs

arXiv:2605.06225v1 Announce Type: new Abstract: Steering large language models (LLMs) is usually done by either instruction prompting or activation steering. Prompting often gives strong control, but caches guidance tokens at every layer and can clutter long interactions; activat…
arXiv cs.LG TIER_1 English(EN) · Sihao Liu, YuFan Xiong, Zhonghua Jiang, Zhaode Wang, chengfei lv Shengyu Zhang · 2026-05-07 04:00

RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction

arXiv:2605.04075v1 Announce Type: new Abstract: Multimodal Large Language Models face severe challenges in computational efficiency and memory consumption due to the substantial expansion of the visual KV cache when processing long visual contexts. Existing KV cache compression m…
arXiv cs.LG TIER_1 English(EN) · Pei-Chun Su · 2026-05-06 04:00

eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization

arXiv:2605.02905v1 Announce Type: new Abstract: We show that the key-value (KV) cache in transformer attention heads admits a natural decomposition into a low-rank \emph{shared context} component and a full-rank \emph{per-token} residual, well described by the spiked random matri…
arXiv cs.LG TIER_1 English(EN) · Jorge L. Ruiz Williams · 2026-05-06 04:00

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

arXiv:2605.03562v1 Announce Type: new Abstract: KV-cache quantizers usually optimize storage-space reconstruction, even though attention reads keys through logits and values through attention-weighted readout. We argue that persistent cache error should be measured in model-visib…
arXiv cs.AI TIER_1 English(EN) · Jorge L. Ruiz Williams · 2026-05-05 09:34

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

KV-cache quantizers usually optimize storage-space reconstruction, even though attention reads keys through logits and values through attention-weighted readout. We argue that persistent cache error should be measured in model-visible coordinates. For keys, the visible object is …
arXiv cs.CV TIER_1 English(EN) · Wei Tao, Xiaoyang Qu, Peiqiang Wang, Guokuan Li, Jiguang Wan, Kai Lu, Jianzong Wang · 2026-05-05 04:00

WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

arXiv:2605.02262v1 Announce Type: new Abstract: Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed…
arXiv cs.CV TIER_1 English(EN) · Jianzong Wang · 2026-05-04 06:17

WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) ca…
Towards AI TIER_1 English(EN) · Ravi Yogesh · 2026-05-09 20:01

Is 3-Bit KV Cache the Holy Grail? A Reality Check on Google’s TurboQuant

<p><em>10 experiments, 3 models, one honest verdict: the quality story is real, the speed story needs a disclaimer, and there’s a finding in the entropy data nobody talks about.</em></p><p>⏱ ~14 min read🔬 Deep Dive⚙️ LLM Inference🗜 Quantization🚀 Serving</p><figure><img alt="" src…
dev.to — LLM tag TIER_1 English(EN) · prasanna kanagasabai · 2026-05-09 03:01

tierKV: A Distributed KV Cache That Makes Evicted Blocks Faster to Restore Than GPU Cache Hits

<h2> The Problem </h2> <p>When your GPU's KV cache fills up, inference engines evict blocks and discard them. The next request with the same prefix re-runs full prefill from scratch — quadratic in sequence length. On a 30,000-token document that's 10+ seconds, every single time t…

报道来源 [11]

相关实体

相关话题