PulseAugur
EN
LIVE 10:42:50

New --numa mirror mode boosts CPU inference performance

A developer has forked the ik_llama.cpp project to introduce a new "--numa mirror" mode designed to enhance performance on multi-socket CPU systems. This mode addresses the significant performance penalty incurred when CPUs access non-local memory in multi-socket configurations by creating duplicate copies of model weights and KV cache for each CPU socket. While this requires more RAM, it allows for the utilization of all CPU cores across all sockets to speed up inference, unlike the "--numa isolate" mode which limits usage to a single socket. The developer is seeking testers to evaluate the performance gains on various hardware setups. AI

IMPACT This optimization could improve inference speeds for users with multi-socket CPU systems, potentially making local LLM deployment more efficient.

RANK_REASON This is a fork of an existing project with a new feature for performance optimization, not a novel release or significant industry event.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New --numa mirror mode boosts CPU inference performance

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/_TheWolfOfWalmart_ ·

    I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers!

    <!-- SC_OFF --><div class="md"><p><strong>GitHub:</strong> <a href="https://github.com/mikechambers84/ik_llama.cpp/tree/numa-mirror">https://github.com/mikechambers84/ik_llama.cpp/tree/numa-mirror</a></p> <p>Be sure to checkout the <code>numa-mirror</code> branch.</p> <p>Sharing …