PulseAugur
EN
LIVE 18:59:04

StepFun 3.7 Flash model achieves 27.5% faster token generation

A user has benchmarked the StepFun Step-3.7-Flash model, a large language model with approximately 200 billion total parameters, on an AMD Ryzen AI Max+ 395 APU. The benchmark utilized a patched llama.cpp build with Vulkan/RADV support and a context size of 12,288 tokens. The results indicate that the Multi-Token Prediction (MTP) feature significantly boosts token generation speed by 27.5%, achieving 26.0 tokens/s, while prefill speed remained largely unchanged. This performance was achieved with lower power consumption compared to a non-MTP baseline. AI

IMPACT Demonstrates improved inference speed for large local models, potentially enabling more responsive AI applications on consumer hardware.

RANK_REASON User benchmark of a specific model version and its performance characteristics. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/westsunset ·

    StepFun 3.7 Flash MTP Bench Strix Halo

    <!-- SC_OFF --><div class="md"><p>This is the StepFun Step-3.7-Flash <code>UD-IQ4_XS</code> main model with the official StepFun MTP <code>Q8_0</code> draft model, served through a patched llama.cpp Vulkan/RADV build.</p> <h1>Host</h1> <ul> <li>System: AMD Ryzen AI Max+ 395 / Rad…