PulseAugur
EN
LIVE 07:27:49

LocalLLaMA user seeks llama-swap concurrent request fix

A user on the r/LocalLLaMA subreddit is seeking assistance with configuring llama-swap to handle concurrent requests for a single model. They have successfully set up Qwen 3.6 35B A3B with multi-GPU support and concurrency enabled via llama-server, but llama-swap appears to serialize requests instead of processing them in parallel. The user has explored various configuration options and issue trackers without success, specifically aiming to avoid running multiple llama-cpp instances to conserve GPU memory. AI

RANK_REASON User-generated question about a specific software configuration issue, not a general release or significant industry event.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/sickmartian ·

    anybody got llama-swap working answering concurrent requests for a single model?

    <!-- SC_OFF --><div class="md"><p>been trying this out for a bit, I have qwen 3.6 35b a3b running via this config:</p> <pre><code>qwen-36-35b-a3b: aliases: - qwen-a3b cmd: | env __GLX_VENDOR_LIBRARY_NAME=nvidia __NV_PRIME_RENDER_OFFLOAD=1 DRI_PRIME=1 \ llama-server \ -m &quot;${b…