English(EN) Why are quants on KV cache increase before weight quants?

大型语言模型用户就内存有限情况下KV缓存精度与权重量化展开辩论

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-02 22:11

r/LocalLLaMA 子版块的用户正在讨论大型语言模型的优化问题，特别是关于在内存有限的情况下，为什么KV缓存精度有时会在权重精度之前增加。这种将KV缓存设置为8位而权重降低到4位的方法被观察到，但在社区中缺乏明确的解释。 AI

影响不适用

排序理由用户对大型语言模型技术优化策略的讨论。

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

r/LocalLLaMA TIER_1 English(EN) · /u/Civil_Fee_7862 · 2026-06-02 22:11

为什么 KV 缓存的量化会在权重量化之前增加？

<div class="md"><p>I'm cases where ram is limited I've seen a preference for increasing kvcache precision instead of the weight precision.</p> <p>I.e. 8bit kvcache but only 4bit weights. </p> <p>But I can't seem to find a solid explanation as to why?</p> </div><!--…