Brief · PulseAugur

COMMENTARY · r/LocalLLaMA English(EN) · 5h

advice for dual-gpu asymmetric

A user on the r/LocalLLaMA subreddit is seeking advice on optimizing performance with an asymmetric dual-GPU setup. They have a 3080 Ti with 12GB VRAM and a 3080 with 20GB VRAM, and are experiencing significant speed drops when the entire model and cache don't fit into VRAM. The user is experimenting with llama.cpp and various quantization and caching strategies to maximize inference speed. AI

IMPACT User seeks to optimize local LLM inference performance, impacting individual operator efficiency.

Qwen
llama.cpp
CUDA
Gemma
r/LocalLLaMA
Arch Linux
NCCL
NVIDIA GeForce RTX 3080 Ti