English(EN) Same week, small update: Run LLMs Locally Multi-Token-Prediction (MTP) for Gemma-4-E4B and Gemma-4-26B from Unsloth. After 50% from QAT, this brings another 25-

Gemma 4 MTP 和 QAT 提升本地 LLM 速度

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-08 15:04

“本地运行 LLM”项目的最新更新引入了 Gemma 模型的 MTP（多令牌预测），在令牌生成方面实现了高达 90% 的速度提升。这种优化与 QAT（量化感知训练）相结合，显著提高了本地 LLM 执行的性能。此外，通过配置调整，提示大小减少了 60%，并实现了所有提示的日志记录。 AI

影响这些针对本地 LLM 执行的优化可以降低高级 AI 应用的入门门槛，使更多用户能够在消费级硬件上运行强大的模型。

排序理由该集群讨论了优化和性能改进，用于在本地运行现有的 LLM 模型，这属于人工智能的研究和开发领域。

在 Mastodon — sigmoid.social 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

Mastodon — sigmoid.social TIER_1 English(EN) · [email protected] · 2026-06-10 18:28

同一周，小更新：Unsloth 为 Gemma-4-E4B 和 Gemma-4-26B 提供本地运行 LLM 多令牌预测 (MTP)。在 QAT 提升 50% 后，这又带来了 25-

Same week, small update: Run LLMs Locally Multi-Token-Prediction (MTP) for Gemma-4-E4B and Gemma-4-26B from Unsloth. After 50% from QAT, this brings another 25-90% improvement in token generation speed. The OpenCode config slide received a small update to reduce prompt sizes with…

链接 codeberg.org/…/Run_LLMs_Locally_2026_Thom…
r/LocalLLaMA TIER_1 English(EN) · /u/Ready_Performance_35 · 2026-06-08 15:04

Gemma 4 QAT + MTP：令牌生成速度最多提高 33%，有什么想法？

<div class="md">Hello, My setup is 2x RTX 3060 Ti 8GB, without the assistant model (MTP) I get around 75t/s, adding the assistant model as draft I manage to reach 100t/s peak. I tried puting the model on a single card with minimal context si…

报道来源 [2]

同一周，小更新：Unsloth 为 Gemma-4-E4B 和 Gemma-4-26B 提供本地运行 LLM 多令牌预测 (MTP)。在 QAT 提升 50% 后，这又带来了 25-

Gemma 4 QAT + MTP：令牌生成速度最多提高 33%，有什么想法？

相关实体

相关话题