English(EN) FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression

FibQuant 方法为 LLM 提供显著的 KV 缓存压缩

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-12 03:45

研究人员开发了 FibQuant，一种新颖的向量量化方法，旨在显著压缩大型语言模型 (LLM) 中使用的键值 (KV) 缓存。该技术通过用更高效的基于向量的方法替换标量量化，旨在减少与长上下文推理相关的内存流量。实验表明，FibQuant 可以在保持高保真度的同时实现显著的压缩率，例如在 GPT-2 small KV 缓存上实现 34 倍压缩，并在 TinyLlama-1.1B 等模型上展示出比现有方法更高的困惑度。 AI

影响通过减少 KV 缓存内存需求，实现更高效的长上下文推理，从而可能降低运营成本并提高模型的可访问性。

排序理由发表了一篇详细介绍 LLM 推理优化新技术的学术论文。

在 arXiv stat.ML 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv stat.ML TIER_1 English(EN) · Namyoon Lee, Yongjune Kim · 2026-05-13 04:00

FibQuant：用于随机访问 KV 缓存压缩的通用向量量化

arXiv:2605.11478v1 Announce Type: cross Abstract: Long-context inference is increasingly a memory-traffic problem. The culprit is the key--value (KV) cache: it grows with context length, batch size, layers, and heads, and it is read at every decoding step. Rotation-based scalar c…
arXiv stat.ML TIER_1 English(EN) · Yongjune Kim · 2026-05-12 03:45

FibQuant：用于随机访问 KV 缓存压缩的通用向量量化

Long-context inference is increasingly a memory-traffic problem. The culprit is the key--value (KV) cache: it grows with context length, batch size, layers, and heads, and it is read at every decoding step. Rotation-based scalar codecs meet this systems constraint by storing a no…

报道来源 [2]

FibQuant：用于随机访问 KV 缓存压缩的通用向量量化

FibQuant：用于随机访问 KV 缓存压缩的通用向量量化

相关实体

相关话题