English(EN)HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ: 模型可见失真与分数空间校正用于KV缓存量化
作者PulseAugur 编辑部·[11 个来源]·
研究人员正在开发几种新颖的方法来优化大型语言模型中的键值(KV)缓存,这是长上下文处理的主要瓶颈。这些方法包括训练模型内在生成可压缩表示(KV-CAT)、操纵潜在注意力空间以实现高效引导(Memory Inception)以及采用先进的量化技术,如int4和谱去噪(eOptShrinkQ、HeadQ)。此外,用于多模态模型的WindowQuant和用于分布式KV缓存管理的tierKV等新策略旨在减少延迟和内存使用,其中tierKV甚至比GPU缓存命中更快地恢复被驱逐的块。
AI
arXiv:2605.05699v1 Announce Type: cross Abstract: KV-cache quantization is framed as a quality--latency trade-off. We show it is \emph{inverted} on Apple Silicon's unified memory: a single fused Metal kernel (sign-randomized FFT $+$ per-channel $\lambda$ $+$ per-group abs-max $+$…
arXiv:2605.05971v1 Announce Type: new Abstract: Long-context language modeling is increasingly constrained by the Key-Value (KV) cache, whose memory and decode-time access costs scale linearly with the prefix length. This bottleneck has motivated a range of context-compression me…
arXiv cs.LG
TIER_1English(EN)·Andy Zeyi Liu, Michael Zhang, Ilana Greenberg, Adam Alnasser, Lucas Baker, John Sous·
arXiv:2605.06225v1 Announce Type: new Abstract: Steering large language models (LLMs) is usually done by either instruction prompting or activation steering. Prompting often gives strong control, but caches guidance tokens at every layer and can clutter long interactions; activat…
arXiv:2605.04075v1 Announce Type: new Abstract: Multimodal Large Language Models face severe challenges in computational efficiency and memory consumption due to the substantial expansion of the visual KV cache when processing long visual contexts. Existing KV cache compression m…
arXiv:2605.02905v1 Announce Type: new Abstract: We show that the key-value (KV) cache in transformer attention heads admits a natural decomposition into a low-rank \emph{shared context} component and a full-rank \emph{per-token} residual, well described by the spiked random matri…
arXiv cs.LG
TIER_1English(EN)·Jorge L. Ruiz Williams·
arXiv:2605.03562v1 Announce Type: new Abstract: KV-cache quantizers usually optimize storage-space reconstruction, even though attention reads keys through logits and values through attention-weighted readout. We argue that persistent cache error should be measured in model-visib…
arXiv cs.AI
TIER_1English(EN)·Jorge L. Ruiz Williams·
KV-cache quantizers usually optimize storage-space reconstruction, even though attention reads keys through logits and values through attention-weighted readout. We argue that persistent cache error should be measured in model-visible coordinates. For keys, the visible object is …
arXiv:2605.02262v1 Announce Type: new Abstract: Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed…
Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) ca…
<p><em>10 experiments, 3 models, one honest verdict: the quality story is real, the speed story needs a disclaimer, and there’s a finding in the entropy data nobody talks about.</em></p><p>⏱ ~14 min read🔬 Deep Dive⚙️ LLM Inference🗜 Quantization🚀 Serving</p><figure><img alt="" src…
dev.to — LLM tag
TIER_1English(EN)·prasanna kanagasabai·
<h2> The Problem </h2> <p>When your GPU's KV cache fills up, inference engines evict blocks and discard them. The next request with the same prefix re-runs full prefill from scratch — quadratic in sequence length. On a 30,000-token document that's 10+ seconds, every single time t…