A user shared optimization tips for running the DeepSeek v4 Flash model locally, achieving nearly 200 tokens per second on a Hopper system. By utilizing specific quants from Canada-Quant and patching the MTP code in vLLM, the user managed to significantly improve inference speed. The post also details the cost implications, noting that electricity costs for token generation currently exceed revenue. AI
IMPACT Provides practical insights for optimizing local LLM inference speeds, potentially reducing operational costs for users.
RANK_REASON User-shared optimization tips for a specific model and hardware setup. [lever_c_demoted from research: ic=1 ai=0.7]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →