Local LLM optimization: Step-3.7-Flash gains 2.4x speed, MTP breaks vision

By PulseAugur Editorial · [1 sources] · 2026-06-28 13:02

A developer has optimized the Step-3.7-Flash (198B-A11B vision MoE) model for local hardware, achieving significant performance gains. By ensuring the model's largest quantization (IQ3_XXS) fits entirely within the 96GB VRAM across four 3090 GPUs, they observed a 2.4x speed increase compared to a higher quantization (IQ4_XS) that spilled data to the CPU. Additionally, the developer found that the model's speculative decoding feature (MTP) is incompatible with its vision capabilities, causing hard aborts when processing image tokens. AI

IMPACT Demonstrates how VRAM capacity significantly impacts local LLM performance, influencing hardware choices and model quantization strategies.

RANK_REASON Developer's optimization of an existing open-source model for local hardware.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Local LLM optimization: Step-3.7-Flash gains 2.4x speed, MTP breaks vision

COVERAGE [1]

r/LocalLLaMA TIER_1 English(EN) · /u/Important_Quote_1180 · 2026-06-28 13:02

Step-3.7-Flash (198B-A11B vision MoE) on 4×3090 — fully-resident IQ3_XXS beats thespilled IQ4 by 2.4×, and MTP speculative decode silently breaks vision

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1uhwra3/step37flash_198ba11b_vision_moe_on_43090/"> <img alt="Step-3.7-Flash (198B-A11B vision MoE) on 4×3090 — fully-resident IQ3_XXS beats thespilled IQ4 by 2.4×, and MTP speculative decode silently breaks v…

COVERAGE [1]

Step-3.7-Flash (198B-A11B vision MoE) on 4×3090 — fully-resident IQ3_XXS beats thespilled IQ4 by 2.4×, and MTP speculative decode silently breaks vision

RELATED ENTITIES

RELATED TOPICS