English(EN) Step-3.7-Flash (198B-A11B vision MoE) on 4×3090 — fully-resident IQ3_XXS beats thespilled IQ4 by 2.4×, and MTP speculative decode silently breaks vision

本地 LLM 优化：Step-3.7-Flash 速度提升 2.4 倍，MTP 破坏视觉

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-28 13:02

一位开发者已针对本地硬件优化了 Step-3.7-Flash (198B-A11B vision MoE) 模型，取得了显著的性能提升。通过确保模型最大的量化 (IQ3_XXS) 完全驻留在四块 3090 GPU 的 96GB VRAM 中，他们观察到与溢出数据到 CPU 的更高量化 (IQ4_XS) 相比，速度提升了 2.4 倍。此外，开发者发现模型的推测解码功能 (MTP) 与其视觉能力不兼容，在处理图像 token 时会导致硬中止。 AI

影响展示了 VRAM 容量如何显著影响本地 LLM 性能，从而影响硬件选择和模型量化策略。

排序理由开发者对现有开源模型进行本地硬件优化。

在 r/LocalLLaMA 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

本地 LLM 优化：Step-3.7-Flash 速度提升 2.4 倍，MTP 破坏视觉

报道来源 [1]

r/LocalLLaMA TIER_1 English(EN) · /u/Important_Quote_1180 · 2026-06-28 13:02

Step-3.7-Flash (198B-A11B vision MoE) on 4×3090 — fully-resident IQ3_XXS beats thespilled IQ4 by 2.4×, and MTP speculative decode silently breaks vision

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1uhwra3/step37flash_198ba11b_vision_moe_on_43090/"> <img alt="Step-3.7-Flash (198B-A11B vision MoE) on 4×3090 — fully-resident IQ3_XXS beats thespilled IQ4 by 2.4×, and MTP speculative decode silently breaks v…

报道来源 [1]

Step-3.7-Flash (198B-A11B vision MoE) on 4×3090 — fully-resident IQ3_XXS beats thespilled IQ4 by 2.4×, and MTP speculative decode silently breaks vision

相关实体

相关话题