Running Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB) — what worked, what didn't, and a surprising speculative-decoding result
A user detailed their experience running the Qwen3.6-35B-A3B model on a laptop with an 8GB RTX 4060 GPU. They found that disabling memory mapping (`--no-mmap`), ensuring sufficient VRAM headroom, and closing CPU-intensive applications significantly improved performance. Surprisingly, speculative decoding provided a 26% speed boost, contrary to other benchmarks, which the user attributes to the model's hybrid architecture with CPU-offloaded experts. AI
IMPACT Provides practical insights for running large language models on limited hardware, potentially improving accessibility and efficiency for local AI deployments.