A user on Reddit has detailed how to run the Qwen3.6-35B-A3B-APEX model with a 128K context window on an RTX 3060 12GB graphics card. This was achieved by utilizing a fork of llama-cpp with CUDA optimizations from spiritbuun and APEX quantization from mudler. The setup allows for 37 tokens per second generation speed with 72,000 tokens filled in the context, and the model achieved 100% retrieval in needle-in-a-haystack tests. AI
IMPACT Demonstrates efficient local execution of large context models on consumer GPUs, lowering barriers for experimentation.
RANK_REASON User-driven optimization and benchmark of an open-source model on consumer hardware. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →