A user has benchmarked the StepFun Step-3.7-Flash model, a large language model with approximately 200 billion total parameters, on an AMD Ryzen AI Max+ 395 APU. The benchmark utilized a patched llama.cpp build with Vulkan/RADV support and a context size of 12,288 tokens. The results indicate that the Multi-Token Prediction (MTP) feature significantly boosts token generation speed by 27.5%, achieving 26.0 tokens/s, while prefill speed remained largely unchanged. This performance was achieved with lower power consumption compared to a non-MTP baseline. AI
IMPACT Demonstrates improved inference speed for large local models, potentially enabling more responsive AI applications on consumer hardware.
RANK_REASON User benchmark of a specific model version and its performance characteristics. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →