English(EN)RoPE-Aware Bit Allocation for KV-Cache Quantization
新的RoPE感知KV缓存量化提升LLM性能
作者PulseAugur 编辑部·[8 个来源]·
一篇新研究论文介绍了一种名为Block-GTQ的新颖方法,用于优化大型语言模型中的KV缓存量化。该技术专门考虑了RoPE(旋转位置嵌入),以更有效地分配比特,优先处理敏感的频率块。Block-GTQ在保持模型保真度和下游任务性能方面表现出显著的改进,尤其是在长上下文检索和推理方面,其性能优于均匀量化方法。该研究还详细介绍了一种打包缓存服务路径,该路径可大幅减少内存使用并提高速度,从而实现更长的上下文窗口。
AI
arXiv:2606.26875v1 Announce Type: cross Abstract: Reasoning capability has advanced rapidly in large language models (LLMs), leading to an increasing size of key-value (KV) cache in both prefilling and decoding stages. Existing KV cache compression methods mainly rely on attentio…
Reasoning capability has advanced rapidly in large language models (LLMs), leading to an increasing size of key-value (KV) cache in both prefilling and decoding stages. Existing KV cache compression methods mainly rely on attention weights to estimate token importance. While atte…
InfoKV is an entropy-aware KV cache compression framework that enhances long-context reasoning in LLMs by incorporating information-theoretic signals alongside attention weights.
arXiv:2606.24033v1 Announce Type: cross Abstract: Existing low-bit KV-cache quantizers often treat each cached key as a flat vector. Under RoPE, however, a key's contribution to a future attention logit decomposes into a position-dependent sum over two-dimensional frequency block…
Existing low-bit KV-cache quantizers often treat each cached key as a flat vector. Under RoPE, however, a key's contribution to a future attention logit decomposes into a position-dependent sum over two-dimensional frequency blocks. This makes key-cache quantization a block-wise …
Block-GTQ introduces a RoPE-aware bit allocation method for key-cache quantization that improves attention accuracy and downstream performance through adaptive bit distribution and packed cache serving.
<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1udjvhd/i_mapped_the_kld_of_kv_cache_quantization_for/"> <img alt="I mapped the KLD of KV cache quantization for Qwen3.6-35B-A3B and Gemma4-E2B QAT" src="https://preview.redd.it/e0qmwmffs19h1.png?width=140&…