PulseAugur
LIVE 16:23:06
tool · [1 source] ·
45
tool

Qwen3-35B MoE model hits 267 tok/s on RTX 5090 with llama.cpp

A user achieved 267 tokens per second for local LLM inference using Qwen3-35B-A3B with llama.cpp's Multi-Token Prediction (MTP) on an RTX 5090. This setup, running on electricity only, significantly outperformed cloud-based models like Claude Haiku in terms of speed and cost. The user hypothesizes that the synergy between Mixture-of-Experts (MoE) architecture and speculative decoding, which nearly doubled throughput with MoE models, is due to MoE's sparse activation patterns leaving compute headroom for MTP. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Demonstrates significant local inference speeds for MoE models, potentially reducing reliance on cloud APIs and lowering operational costs.

RANK_REASON User-driven benchmark and performance analysis of an open-source model and inference engine. [lever_c_demoted from research: ic=1 ai=1.0]

Read on dev.to — LLM tag →

COVERAGE [1]

  1. dev.to — LLM tag TIER_1 · gen ·

    267 tok/s local inference on RTX 5090 – llama.cpp MTP + Qwen3-35B-A3B MoE

    <p>Been running Qwen3-35B-A3B (MoE) with llama.cpp's Multi-Token Prediction <br /> (MTP / speculative decoding) on an RTX 5090 under WSL2. Results surprised me:</p> <div class="table-wrapper-paragraph"><table> <thead> <tr> <th>Model</th> <th>Speed</th> </tr> </thead> <tbody> <tr>…