Brief · PulseAugur

TOOL · r/LocalLLaMA English(EN) · 4h

Running Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB) — what worked, what didn't, and a surprising speculative-decoding result

A user detailed their experience running the Qwen3.6-35B-A3B model on a laptop with an 8GB RTX 4060 GPU. They found that disabling memory mapping (`--no-mmap`), ensuring sufficient VRAM headroom, and closing CPU-intensive applications significantly improved performance. Surprisingly, speculative decoding provided a 26% speed boost, contrary to other benchmarks, which the user attributes to the model's hybrid architecture with CPU-offloaded experts. AI

IMPACT Provides practical insights for running large language models on limited hardware, potentially improving accessibility and efficiency for local AI deployments.

llama.cpp
Qwen3.6-35B-A3B
Qwen3.5-0.8B
RTX 4060 Laptop