新研究着眼于 KV 缓存压缩以提高 LLM 效率

作者 PulseAugur 编辑部 · [3 个来源] · 2026-06-15 00:00

两篇新研究论文提出了用于压缩大型语言模型（LLM）中 KV 缓存的新颖方法，以提高推理效率。第一篇论文 PolyKV 引入了一个逐层优化框架，该框架根据变压器层的特定作用，对它们应用不同的压缩策略和预算。第二篇论文 BACON 专注于多模态 LLM，并校准注意力机制，以便在激进压缩下更好地保留关键视觉信息。 AI

影响这些方法旨在降低 LLM 推理中的内存成本和延迟，从而可能实现更长的上下文窗口和更高效的多模态模型部署。

排序理由两篇 arXiv 论文提出了用于 LLM 中 KV 缓存压缩的新颖方法。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。我们如何撰写摘要 →

报道来源 [3]

arXiv cs.AI TIER_1 English(EN) · Chao Fei, Panos Kalnis · 2026-06-16 04:00

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

arXiv:2606.15157v1 Announce Type: cross Abstract: KV cache compression is essential for reducing the memory cost of long-context large language model inference. Existing approaches, however, typically apply a single compression policy and a uniform cache budget across all transfo…
arXiv cs.CL TIER_1 English(EN) · Tianhao Chen, Yuheng Wu, Kelu Yao, Xiaogang Xu, Xiaobin Hu, Dongman Lee · 2026-06-16 04:00

Last But Not Least: Boundary Attention CalibratiON for Multimodal KV Cache Compression

arXiv:2606.14782v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) achieve strong vision-language reasoning, but long visual contexts enlarge the KV cache and increase decoding latency. Existing compression methods rely on observation window attention for …
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-15 00:00

Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving

Multi-turn large language model serving faces memory constraints due to growing key-value cache, but a structured approach to non-uniform compression enables significant throughput improvements through static budget allocation and optimized memory management.

报道来源 [3]

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

Last But Not Least: Boundary Attention CalibratiON for Multimodal KV Cache Compression

Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving

相关实体

相关话题