PulseAugur
EN
LIVE 10:16:05

User achieves near-linear scaling with dual GPUs for Qwen LLM

A user on Reddit's r/LocalLLaMA forum reported achieving near-linear performance scaling by adding a second GPU to their setup. When using the Qwen 3.6-27B-autoround-int4 model, doubling the GPUs from one to two resulted in a significant increase in decoding throughput for both narrative and code tasks. This improvement was observed even without NVLink, utilizing tensor parallelism and P2P communication. AI

IMPACT Demonstrates potential for improved inference performance with multi-GPU setups for local LLM deployments.

RANK_REASON User-generated report on model performance scaling with hardware. [lever_c_demoted from research: ic=1 ai=0.7]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/Civil_Fee_7862 ·

    Weird to get near linear scaling by adding another GPU?

    <!-- SC_OFF --><div class="md"><p>Single steam benchmarks (club-3090)</p> <p>model:</p> <pre><code>qwen3.6-27b-autoround-int4 </code></pre> <p><strong>BEFORE:</strong></p> <p>1x3090</p> <p>*Their default script recipe for single 3090'*s <em>(4-bit quant and 4-bit kv cache, mtp=2)…