A user has shared configurations and benchmarks for running the DeepSeek V4-Flash model on dual DGX Sparks hardware. The setup achieves approximately 40 tera-tokens per second with FP8 precision, and can aggregate up to 350 tera-tokens per second when handling multiple requests with a 256k context window. This performance is compared against Nvidia RTX Pro 6000 and Mac M2 Ultra systems, highlighting the dual DGX setup's efficiency for large model inference. AI
IMPACT Demonstrates high-throughput inference for large models on accessible hardware, potentially lowering barriers for advanced AI applications.
RANK_REASON User-generated benchmark and configuration for running a specific LLM on consumer/prosumer hardware. [lever_c_demoted from research: ic=1 ai=0.7]
- DeepSeek V4-Flash
- DGX Sparks
- Mac M2 Ultra
- MOD St Athan
- Nvidia RTX Pro 6000 Blackwell Workstation Edition
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →