(CA) llama: limit max outputs of `llama_context` by am17an · Pull Request #23861 · ggml-org/llama.cpp

llama.cpp PR 通过限制上下文输出来优化 VRAM

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-01 15:29

一项针对 llama.cpp 项目的拉取请求旨在通过限制 `llama_context` 的最大输出来优化 VRAM 使用。此更改基于之前的 PR，仅在必要时才保留 logits 空间，可能节省大量内存。开发者建议在 llama-context 中提供一个 API 来管理此保留，默认情况下为所有 token，但允许特定的服务器上下文设置。 AI

影响此优化通过降低 VRAM 要求，可能允许在消费级硬件上运行更大的模型。

排序理由这是一个开源项目的拉取请求，属于工具级别的更新。

在 r/LocalLLaMA 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

r/LocalLLaMA TIER_1 (CA) · /u/pmttyji · 2026-06-01 15:29

llama: limit max outputs of `llama_context` by am17an · Pull Request #23861 · ggml-org/llama.cpp

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1ttvpmt/llama_limit_max_outputs_of_llama_context_by/"> <img alt="llama: limit max outputs of `llama_context` by am17an · Pull Request #23861 · ggml-org/llama.cpp" src="https://external-preview.redd.it/46O8N_DI…

报道来源 [1]

llama: limit max outputs of `llama_context` by am17an · Pull Request #23861 · ggml-org/llama.cpp

相关实体

相关话题