Brief · PulseAugur

TOOL · r/LocalLLaMA English(EN) · 20h

Dual DGX Sparks- 40tk/s single 1M ; 350 tk/s agg. - Deepseek V4 Flash (vs RTX Pro 6000 vs Mac M2 Ultra 192)

A user has shared configurations and benchmarks for running the DeepSeek V4-Flash model on dual DGX Sparks hardware. The setup achieves approximately 40 tera-tokens per second with FP8 precision, and can aggregate up to 350 tera-tokens per second when handling multiple requests with a 256k context window. This performance is compared against Nvidia RTX Pro 6000 and Mac M2 Ultra systems, highlighting the dual DGX setup's efficiency for large model inference. AI

IMPACT Demonstrates high-throughput inference for large models on accessible hardware, potentially lowering barriers for advanced AI applications.

DeepSeek V4-Flash
Nvidia RTX Pro 6000 Blackwell Workstation Edition
MOD St Athan
DGX Sparks
Mac M2 Ultra