Qwen3.6-27B 模型通过张量拆分实现 100+ t/s

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-23 03:29

Reddit r/LocalLLaMA 子版块的一位用户分享了他们运行 Q8 量化的 Qwen3.6-27B 模型的设置。通过在 RTX 5090 和 RTX 3090 Ti 之间切换到张量拆分模式，他们实现了每秒超过 100 个 token 的吞吐量，相比之前的层拆分设置有了显著提升。该配置涉及 70/30 的张量拆分，偏向于更强大的 5090，并且仅 GPU 就消耗了超过 750W 的功率。 AI

影响展示了本地 LLM 部署的高效多 GPU 推理配置。

排序理由用户分享的在消费级硬件上运行特定 LLM 的配置。

在 r/LocalLLaMA 阅读 →

基础设施

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

r/LocalLLaMA TIER_1 English(EN) · /u/Shoddy_Bed3240 · 2026-06-23 03:29

100+ t/s on Qwen3.6-27B Q8 across a 5090 + 3090 Ti — switching to tensor split-mode got me from 70 to 100+

<div class="md"><p>Wanted to share a setup that's been working great for me. Running Qwen3.6-27B at Q8_0 across two GPUs (RTX 5090 + RTX 3090 Ti) and getting ~100 t/s.</p> <p>The big jump came from switching <code>--split-mode</code> to <code>tensor</code>. I was s…

报道来源 [1]

100+ t/s on Qwen3.6-27B Q8 across a 5090 + 3090 Ti — switching to tensor split-mode got me from 70 to 100+

相关实体

相关话题