PulseAugur
EN
LIVE 20:06:07

User seeks advice on optimizing dual-GPU inference with llama.cpp

A user on the r/LocalLLaMA subreddit is seeking advice on optimizing performance with an asymmetric dual-GPU setup. They have a 3080 Ti with 12GB VRAM and a 3080 with 20GB VRAM, and are experiencing significant speed drops when the entire model and cache don't fit into VRAM. The user is experimenting with llama.cpp and various quantization and caching strategies to maximize inference speed. AI

IMPACT User seeks to optimize local LLM inference performance, impacting individual operator efficiency.

RANK_REASON User-generated advice request on a technical forum.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/pentothal ·

    advice for dual-gpu asymmetric

    <!-- SC_OFF --><div class="md"><p>Hello everyone, i had a 3080ti 12gb and added a 3080 20gb, so it has a bit less speed but more memory than my main card.<br /> I could finally get some speed with the usual suspects (i am testing gemma 4 31b/26b-a4b and qwen 3.6 27b/35b-a3b), BUT…