PulseAugur / Brief
EN
LIVE 01:01:43

Brief

last 24h
[1/1] 223 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. llama-server router: a model pinned to one GPU still grabs a CUDA context on every card, so it OOMs when my others are full. Am I missing a flag or is this just how it is?

    A user on the r/LocalLLaMA subreddit is encountering an issue with the llama-server router mode where each model instance, even when pinned to a specific GPU, allocates a CUDA context on all available GPUs. This behavior leads to out-of-memory errors when running multiple models, particularly when a large model consumes most of the VRAM on some cards, preventing smaller models from initializing their contexts on other GPUs. The user is seeking a solution, such as a specific flag or configuration, to prevent context allocation on unused GPUs or is looking for alternative strategies to manage GPU resources for both multiple small models and occasional single large model deployments. AI