PulseAugur
实时 07:39:06
English(EN) RoPE-Aware Bit Allocation for KV-Cache Quantization

新方法通过KV缓存压缩和量化提升LLM效率

研究人员开发了新的方法来提高大型语言模型(LLMs)的效率,方法是压缩它们的键值(KV)缓存。一种名为InfoKV的方法,利用预测不确定性等信息论信号以及注意力权重,来更好地估计token重要性以进行压缩,在Llama-3.1和DeepSeek-R1等模型的长上下文推理任务上表现出改进的性能。另一种方法Block-GTQ,专注于RoPE感知比特分配用于KV缓存量化,根据RoPE中不同频率块对量化误差的敏感度来调整比特分布。该技术显著提高了长上下文检索和推理等任务的下游性能,并能在最小的质量损失下实现大量的KV缓存压缩,如在Llama-3.1-8B-Instruct和Qwen2.5-3B-Instruct等模型上所展示的。 AI

影响 KV缓存压缩和量化方面的这些进展有望显著降低LLMs的内存使用量并提高推理速度,从而实现更长的上下文窗口和更高效的部署。

排序理由 多篇研究论文和社区讨论详细介绍了LLMs中KV缓存压缩和量化的新颖方法。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 8 个来源。 我们如何撰写摘要 →

新方法通过KV缓存压缩和量化提升LLM效率

报道来源 [8]

  1. arXiv cs.AI TIER_1 English(EN) · Jushi Kai, Zhuiri Xiao, Alexandra Birch, Zhouhan Lin ·

    Information-Aware KV Cache Compression for Long Reasoning

    arXiv:2606.26875v1 Announce Type: cross Abstract: Reasoning capability has advanced rapidly in large language models (LLMs), leading to an increasing size of key-value (KV) cache in both prefilling and decoding stages. Existing KV cache compression methods mainly rely on attentio…

  2. arXiv cs.AI TIER_1 English(EN) · Zhouhan Lin ·

    Information-Aware KV Cache Compression for Long Reasoning

    Reasoning capability has advanced rapidly in large language models (LLMs), leading to an increasing size of key-value (KV) cache in both prefilling and decoding stages. Existing KV cache compression methods mainly rely on attention weights to estimate token importance. While atte…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    Information-Aware KV Cache Compression for Long Reasoning

    InfoKV is an entropy-aware KV cache compression framework that enhances long-context reasoning in LLMs by incorporating information-theoretic signals alongside attention weights.

  4. arXiv cs.CL TIER_1 English(EN) · Fengfeng Liang, Yuechen Zhang, Jiaya Jia ·

    RoPE-Aware Bit Allocation for KV-Cache Quantization

    arXiv:2606.24033v1 Announce Type: cross Abstract: Existing low-bit KV-cache quantizers often treat each cached key as a flat vector. Under RoPE, however, a key's contribution to a future attention logit decomposes into a position-dependent sum over two-dimensional frequency block…

  5. arXiv cs.CL TIER_1 English(EN) · Jiaya Jia ·

    RoPE-Aware Bit Allocation for KV-Cache Quantization

    Existing low-bit KV-cache quantizers often treat each cached key as a flat vector. Under RoPE, however, a key's contribution to a future attention logit decomposes into a position-dependent sum over two-dimensional frequency blocks. This makes key-cache quantization a block-wise …

  6. Hugging Face Daily Papers TIER_1 English(EN) ·

    RoPE-Aware Bit Allocation for KV-Cache Quantization

    Block-GTQ introduces a RoPE-aware bit allocation method for key-cache quantization that improves attention accuracy and downstream performance through adaptive bit distribution and packed cache serving.

  7. r/LocalLLaMA TIER_1 English(EN) · /u/crusaderky ·

    我绘制了 Qwen3.6-35B-A3B 和 Gemma4-E2B QAT 的 KV 缓存量化的 KLD 图

    <table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1udjvhd/i_mapped_the_kld_of_kv_cache_quantization_for/"> <img alt="I mapped the KLD of KV cache quantization for Qwen3.6-35B-A3B and Gemma4-E2B QAT" src="https://preview.redd.it/e0qmwmffs19h1.png?width=140&amp…

  8. r/LocalLLaMA TIER_1 English(EN) · /u/rima_2711 ·

    Gemma 4 QAT 在 KV 缓存量化方面似乎响应更好

    <table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1ubl0df/gemma_4_qat_seems_to_respond_significantly_better/"> <img alt="Gemma 4 QAT seems to respond significantly better to KV cache quantization" src="https://preview.redd.it/wxvhm0r1ml8h1.png?width=320&amp;c…