A user on the r/LocalLLaMA subreddit is encountering an issue with the llama-server router mode where each model instance, even when pinned to a specific GPU, allocates a CUDA context on all available GPUs. This behavior leads to out-of-memory errors when running multiple models, particularly when a large model consumes most of the VRAM on some cards, preventing smaller models from initializing their contexts on other GPUs. The user is seeking a solution, such as a specific flag or configuration, to prevent context allocation on unused GPUs or is looking for alternative strategies to manage GPU resources for both multiple small models and occasional single large model deployments. AI
RANK_REASON User question about a specific software configuration issue, not a general industry trend or release.
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →