Stochastic KV Routing enables adaptive depth-wise cache sharing for LLMs

作者 PulseAugur 编辑部 · [1 个来源] · 2026-04-28 04:00

研究人员开发了一种名为 Stochastic KV Routing 的新方法，以减小 Transformer 语言模型的内存占用。该技术通过训练层随机关注先前层的 KV 状态，从而实现自适应的深度缓存共享。评估表明，该方法可以在不牺牲性能的情况下显著降低内存需求，甚至可以在数据受限的情况下充当一种正则化方法。 AI

影响减少 KV 缓存内存占用，可能降低 Transformer 模型的服务成本。

排序理由学术论文，提出了一种用于优化 Transformer 模型推理的新颖方法。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

Stochastic KV Routing enables adaptive depth-wise cache sharing for LLMs

报道来源 [1]

arXiv cs.LG TIER_1 English(EN) · Anastasiia Filippova, David Grangier, Marco Cuturi, Jo\~ao Monteiro · 2026-04-28 04:00

随机键值路由：实现自适应深度缓存共享

arXiv:2604.22782v1 Announce Type: new Abstract: Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts servin…

报道来源 [1]

随机键值路由：实现自适应深度缓存共享

相关实体

相关话题