English(EN) Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

InfiniteKV 使 LLM 能够访问远超训练限制的上下文

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-12 06:34

InfiniteKV 是一个新颖的 KV 缓存系统，旨在通过将旧 token 以压缩、可搜索的格式存储在磁盘或 RAM 中来扩展大型语言模型的上下文窗口。这种方法使模型能够访问远超其原始训练限制的信息，Mistral-7B 成功从第 76,747 个 token 回答了一个查询，显著超出了其 32,768 个 token 的限制，证明了这一点。该系统将最近的 token 保存在 GPU 内存中以提高速度，同时将较旧的 token 卸载到磁盘，将每百万 token 的内存需求从几 GB 大幅降低到仅几 MB。 AI

影响使 LLM 能够处理和回忆来自极大扩展上下文的信息，有可能在长篇内容分析和生成方面解锁新应用。

排序理由这是一种扩展 LLM 上下文窗口的新颖技术方法，作为一个开源项目发布，并附有可验证的结果。[lever_c_demoted from research: ic=1 ai=1.0]

在 r/LocalLLaMA 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

r/LocalLLaMA TIER_1 English(EN) · /u/Final-Data-1410 · 2026-06-12 06:34

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

<div class="md"><p>What it is, in plain words. Your GPU keeps two float vectors for every token of your conversation. That’s the KV cache, and it’s why long contexts eat VRAM: Llama-3.1-8B needs about 0.12 MB per token, so 100k tokens costs 12 GB and a million toke…

报道来源 [1]

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

相关实体

相关话题