PulseAugur
EN
LIVE 19:02:51

User optimizes DeepSeek v4 Flash for 200 tok/s on Hopper

A user shared optimization tips for running the DeepSeek v4 Flash model locally, achieving nearly 200 tokens per second on a Hopper system. By utilizing specific quants from Canada-Quant and patching the MTP code in vLLM, the user managed to significantly improve inference speed. The post also details the cost implications, noting that electricity costs for token generation currently exceed revenue. AI

IMPACT Provides practical insights for optimizing local LLM inference speeds, potentially reducing operational costs for users.

RANK_REASON User-shared optimization tips for a specific model and hardware setup. [lever_c_demoted from research: ic=1 ai=0.7]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/Reddactor ·

    Here are some tips on hitting nearly 200 tok/s for DeepSeek v4 Flash on Hopper

    <!-- SC_OFF --><div class="md"><p>I needed a smarter model for my local Hermes Agent setup, so I moved to DeepSeek v4 Flash.</p> <p>First things first:</p> <ul> <li>Running 4 concurrent threads on vLLM, I can hit ~400 tok/s</li> <li>400 x 60 x 60 x 24 x 30 is <strong>~1B TOKENS p…