PulseAugur
EN
LIVE 03:51:31

llama.cpp performance boosted 80% by optimizing thread count

A user on Reddit's r/LocalLLaMA subreddit has discovered a significant performance improvement in the llama.cpp inference engine by adjusting the `--threads` argument. Initially, it was believed that limiting threads to the number of performance cores was optimal for hybrid CPU setups. However, testing with the Gemma 4 26B A4B QAT model revealed that increasing the thread count to 16 on a CPU with 18 cores (6 performance, 12 efficiency) resulted in an approximately 80% performance uplift. This finding suggests that users should experiment with thread counts beyond the number of performance cores to maximize inference speed, especially for CPU or hybrid CPU/GPU setups. AI

IMPACT Optimizing thread counts can unlock significant performance gains for local LLM inference, potentially making larger models more accessible on consumer hardware.

RANK_REASON User-discovered optimization for an open-source inference engine.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/AXYZE8 ·

    PSA: Test your "threads" argument in llama.cpp (+80% performance in my case)

    <!-- SC_OFF --><div class="md"><p>When GPT-OSS 120B has released last year I played around and tried to maximize it's performance. One thing that many people pointed out was that for hybrid CPU (Performance + Efficiency cores) you should use only P-cores with &quot;--threads&quot…