DeepSeek V4-Flash achieves 40 Ttk/s on dual DGX Sparks

By PulseAugur Editorial · [1 sources] · 2026-06-14 09:07

A user has shared configurations and benchmarks for running the DeepSeek V4-Flash model on dual DGX Sparks hardware. The setup achieves approximately 40 tera-tokens per second with FP8 precision, and can aggregate up to 350 tera-tokens per second when handling multiple requests with a 256k context window. This performance is compared against Nvidia RTX Pro 6000 and Mac M2 Ultra systems, highlighting the dual DGX setup's efficiency for large model inference. AI

IMPACT Demonstrates high-throughput inference for large models on accessible hardware, potentially lowering barriers for advanced AI applications.

RANK_REASON User-generated benchmark and configuration for running a specific LLM on consumer/prosumer hardware. [lever_c_demoted from research: ic=1 ai=0.7]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

DeepSeek V4-Flash achieves 40 Ttk/s on dual DGX Sparks

COVERAGE [1]

r/LocalLLaMA TIER_1 English(EN) · /u/elsung · 2026-06-14 09:07

Dual DGX Sparks- 40tk/s single 1M ; 350 tk/s agg. - Deepseek V4 Flash (vs RTX Pro 6000 vs Mac M2 Ultra 192)

<div class="md"><p>First of all shout out to Aiden/Antirez & geniuses at the Nvidia community threads. I'm merely claude-vibing off of their works.</p> <p>That a said, i thought i'd share recipes & learnings & benchmarks so far on running big MOE models…

COVERAGE [1]

Dual DGX Sparks- 40tk/s single 1M ; 350 tk/s agg. - Deepseek V4 Flash (vs RTX Pro 6000 vs Mac M2 Ultra 192)

RELATED ENTITIES

RELATED TOPICS