English(EN) Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context.

LLaMA.cpp 用户寻求长上下文和 KV 缓存优化

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-28 07:21

一位 r/LocalLLaMA 子版块的用户正在寻求关于优化 llama.cpp 框架以处理长上下文和高效 KV 缓存量化的建议。他们目前正在使用一个修改版的 llama.cpp，集成了 MTP（多线程处理）和 Q4 缓存，在较短的上下文中速度约为每秒 60 个 token，但随着上下文长度的增加，速度显著下降。用户正在寻找能够为扩展上下文窗口提供更好性能的替代方法或配置。 AI

影响用户正在探索提高本地 LLM 推理在更长上下文方面的性能的方法。

排序理由用户在论坛上提出的关于优化现有软件的问题，并非新发布或重大事件。

在 r/LocalLLaMA 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

r/LocalLLaMA TIER_1 English(EN) · /u/GodComplecs · 2026-05-28 07:21

问题：Llama cpp，目前在 MTP、KV 缓存量化、长上下文方面有什么好的选择？

<div class="md"><p>Used the vllm version of <a href="https://github.com/noonghunna/club-3090">https://github.com/noonghunna/club-3090</a></p> <p>It worked fine for myabe 20 40k context, havent tried the new one. Anyone used the new llama.cpp patched one for single …

报道来源 [1]

问题：Llama cpp，目前在 MTP、KV 缓存量化、长上下文方面有什么好的选择？

相关实体

相关话题