PulseAugur
EN
LIVE 00:17:44

LLM users debate KV cache precision over weight quantization for limited RAM

Users on the r/LocalLLaMA subreddit are discussing the optimization of large language models, specifically questioning why Key-Value (KV) cache precision is sometimes increased before weight precision when RAM is limited. This approach, where KV cache is set to 8-bit while weights are reduced to 4-bit, is observed but lacks a clear explanation within the community. AI

IMPACT N/A

RANK_REASON User discussion on a technical optimization strategy for LLMs.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/Civil_Fee_7862 ·

    Why are quants on KV cache increase before weight quants?

    <!-- SC_OFF --><div class="md"><p>I'm cases where ram is limited I've seen a preference for increasing kvcache precision instead of the weight precision.</p> <p>I.e. 8bit kvcache but only 4bit weights. </p> <p>But I can't seem to find a solid explanation as to why?</p> </div><!--…