A user achieved 267 tokens per second for local LLM inference using Qwen3-35B-A3B with llama.cpp's Multi-Token Prediction (MTP) on an RTX 5090. This setup, running on electricity only, significantly outperformed cloud-based models like Claude Haiku in terms of speed and cost. The user hypothesizes that the synergy between Mixture-of-Experts (MoE) architecture and speculative decoding, which nearly doubled throughput with MoE models, is due to MoE's sparse activation patterns leaving compute headroom for MTP. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Demonstrates significant local inference speeds for MoE models, potentially reducing reliance on cloud APIs and lowering operational costs.
RANK_REASON User-driven benchmark and performance analysis of an open-source model and inference engine. [lever_c_demoted from research: ic=1 ai=1.0]