Brief · PulseAugur

TOOL · r/LocalLLaMA English(EN) · 3h

Here are some tips on hitting nearly 200 tok/s for DeepSeek v4 Flash on Hopper

A user shared optimization tips for running the DeepSeek v4 Flash model locally, achieving nearly 200 tokens per second on a Hopper system. By utilizing specific quants from Canada-Quant and patching the MTP code in vLLM, the user managed to significantly improve inference speed. The post also details the cost implications, noting that electricity costs for token generation currently exceed revenue. AI

IMPACT Provides practical insights for optimizing local LLM inference speeds, potentially reducing operational costs for users.

Hopper
DeepSeek v4 Flash
vLLM
Canada-Quant