Qwen3.6-35B 模型在 RTX 3060 上运行 128K 上下文

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-28 11:12

一位 Reddit 用户详细介绍了如何在 RTX 3060 12GB 显卡上运行具有 128K 上下文窗口的 Qwen3.6-35B-A3B-APEX 模型。这是通过利用 llama-cpp 的一个分支，并结合 spiritbuun 的 CUDA 优化和 mudler 的 APEX 量化来实现的。该设置在上下文填充 72,000 个 token 时，可实现每秒 37 个 token 的生成速度，并且模型在“针尖麦芒”测试中达到了 100% 的检索率。 AI

影响展示了在消费级 GPU 上高效本地运行大上下文模型，降低了实验门槛。

排序理由用户驱动的在消费级硬件上对开源模型的优化和基准测试。[lever_c_demoted from research: ic=1 ai=1.0]

在 r/LocalLLaMA 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

r/LocalLLaMA TIER_1 English(EN) · /u/old-mike · 2026-05-28 11:12

Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s 生成，72k ctx 已填充，PPL 3.25，卸载 17GB 模型

<div class="md">I'm posting this because it may be helpful to squeeze the 12GB VRAM in the 3060. All credit goes to spiritbuun's fork (<a href="https://github.com/spiritbuun/buun-llama-cpp">github.com/spiritbuun/buun-llama-cpp</a>) and <s…

报道来源 [1]

Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s 生成，72k ctx 已填充，PPL 3.25，卸载 17GB 模型

相关实体

相关话题