A user on the r/LocalLLaMA subreddit is seeking advice on how to improve the inference speed of their local large language model setup. Despite having a laptop with a powerful RTX 5070 Ti GPU (12GB VRAM), 32GB RAM, and a high-end Intel Core Ultra 9 processor, they are only achieving 37 tokens per second with the Qwen3.6-35B-A3B-Q6_K_P model. They have experimented with various command-line arguments for llama.cpp, including different quantization levels and context sizes, but have not found significant improvements. AI
IMPACT Users running local LLMs may face similar performance challenges and can learn from the advice shared in this discussion.
RANK_REASON User is asking for advice on a technical issue related to running a local LLM, which falls under commentary/discussion rather than a new release or significant event.
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →