PulseAugur
EN
LIVE 11:37:14

Qwen3.6-35B model runs 128K context on RTX 3060

A user on Reddit has detailed how to run the Qwen3.6-35B-A3B-APEX model with a 128K context window on an RTX 3060 12GB graphics card. This was achieved by utilizing a fork of llama-cpp with CUDA optimizations from spiritbuun and APEX quantization from mudler. The setup allows for 37 tokens per second generation speed with 72,000 tokens filled in the context, and the model achieved 100% retrieval in needle-in-a-haystack tests. AI

IMPACT Demonstrates efficient local execution of large context models on consumer GPUs, lowering barriers for experimentation.

RANK_REASON User-driven optimization and benchmark of an open-source model on consumer hardware. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/old-mike ·

    Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model

    <!-- SC_OFF --><div class="md"><p>I'm posting this because it may be helpful to squeeze the 12GB VRAM in the 3060.</p> <p>All credit goes to <strong>spiritbuun's fork</strong> (<a href="https://github.com/spiritbuun/buun-llama-cpp">github.com/spiritbuun/buun-llama-cpp</a>) and <s…