Brief · PulseAugur

TOOL · r/LocalLLaMA English(EN) · 6h

PSA: Test your "threads" argument in llama.cpp (+80% performance in my case)

A user on Reddit's r/LocalLLaMA subreddit has discovered a significant performance improvement in the llama.cpp inference engine by adjusting the `--threads` argument. Initially, it was believed that limiting threads to the number of performance cores was optimal for hybrid CPU setups. However, testing with the Gemma 4 26B A4B QAT model revealed that increasing the thread count to 16 on a CPU with 18 cores (6 performance, 12 efficiency) resulted in an approximately 80% performance uplift. This finding suggests that users should experiment with thread counts beyond the number of performance cores to maximize inference speed, especially for CPU or hybrid CPU/GPU setups. AI

IMPACT Optimizing thread counts can unlock significant performance gains for local LLM inference, potentially making larger models more accessible on consumer hardware.

GPT-OSS 120B
llama.cpp
Unsloth
Threads
RTX 4070 SUPER 12GB
14700K
250K Plus
Gemma 4 26B A4B QAT