PulseAugur
EN
LIVE 02:46:27

DeepSeek V4-Flash achieves 40 Ttk/s on dual DGX Sparks

A user has shared configurations and benchmarks for running the DeepSeek V4-Flash model on dual DGX Sparks hardware. The setup achieves approximately 40 tera-tokens per second with FP8 precision, and can aggregate up to 350 tera-tokens per second when handling multiple requests with a 256k context window. This performance is compared against Nvidia RTX Pro 6000 and Mac M2 Ultra systems, highlighting the dual DGX setup's efficiency for large model inference. AI

IMPACT Demonstrates high-throughput inference for large models on accessible hardware, potentially lowering barriers for advanced AI applications.

RANK_REASON User-generated benchmark and configuration for running a specific LLM on consumer/prosumer hardware. [lever_c_demoted from research: ic=1 ai=0.7]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/elsung ·

    Dual DGX Sparks- 40tk/s single 1M ; 350 tk/s agg. - Deepseek V4 Flash (vs RTX Pro 6000 vs Mac M2 Ultra 192)

    <!-- SC_OFF --><div class="md"><p>First of all shout out to Aiden/Antirez &amp; geniuses at the Nvidia community threads. I'm merely claude-vibing off of their works.</p> <p>That a said, i thought i'd share recipes &amp; learnings &amp; benchmarks so far on running big MOE models…