A user on Reddit's r/LocalLLaMA community shared impressive performance gains using a new build of llama.cpp, specifically version b9455. This updated version, when combined with tensor splitting across two RTX 3090 GPUs, achieved over 70 tokens per second with the Qwen3.6-27B-UD-Q8_K_XL model. This significantly surpasses previous speeds, which were in the 30-50 tokens per second range, and matches the performance previously only seen with vLLM. AI
IMPACT This update to llama.cpp significantly boosts inference speed for local LLM deployments, potentially enabling more complex models to run efficiently on consumer hardware.
RANK_REASON User-shared benchmark results for an open-source inference engine. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →