Brief · PulseAugur

MEME · r/LocalLLaMA English(EN) · 3h

llama-server router: a model pinned to one GPU still grabs a CUDA context on every card, so it OOMs when my others are full. Am I missing a flag or is this just how it is?

A user on the r/LocalLLaMA subreddit is encountering an issue with the llama-server router mode where each model instance, even when pinned to a specific GPU, allocates a CUDA context on all available GPUs. This behavior leads to out-of-memory errors when running multiple models, particularly when a large model consumes most of the VRAM on some cards, preventing smaller models from initializing their contexts on other GPUs. The user is seeking a solution, such as a specific flag or configuration, to prevent context allocation on unused GPUs or is looking for alternative strategies to manage GPU resources for both multiple small models and occasional single large model deployments. AI

CUDA
llama-server
Gemma
GPU
Nomic