PulseAugur
EN
LIVE 22:46:28

llama-server router allocates CUDA context on all GPUs, causing OOM errors

A user on the r/LocalLLaMA subreddit is encountering an issue with the llama-server router mode where each model instance, even when pinned to a specific GPU, allocates a CUDA context on all available GPUs. This behavior leads to out-of-memory errors when running multiple models, particularly when a large model consumes most of the VRAM on some cards, preventing smaller models from initializing their contexts on other GPUs. The user is seeking a solution, such as a specific flag or configuration, to prevent context allocation on unused GPUs or is looking for alternative strategies to manage GPU resources for both multiple small models and occasional single large model deployments. AI

RANK_REASON User question about a specific software configuration issue, not a general industry trend or release.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/HockeyDadNinja ·

    llama-server router: a model pinned to one GPU still grabs a CUDA context on every card, so it OOMs when my others are full. Am I missing a flag or is this just how it is?

    <!-- SC_OFF --><div class="md"><p>Running into something annoying with llama-server in router mode (`--models-preset`) and I can't tell if I'm missing a flag or if this is just how it works. </p> <p>My rig is 2x 3090, 2x 4060 Ti (one's unplugged at the moment, riser got repurpose…