PulseAugur
EN
LIVE 21:34:09

User seeks to boost local LLM speed on high-end laptop

A user on the r/LocalLLaMA subreddit is seeking advice on how to improve the inference speed of their local large language model setup. Despite having a laptop with a powerful RTX 5070 Ti GPU (12GB VRAM), 32GB RAM, and a high-end Intel Core Ultra 9 processor, they are only achieving 37 tokens per second with the Qwen3.6-35B-A3B-Q6_K_P model. They have experimented with various command-line arguments for llama.cpp, including different quantization levels and context sizes, but have not found significant improvements. AI

IMPACT Users running local LLMs may face similar performance challenges and can learn from the advice shared in this discussion.

RANK_REASON User is asking for advice on a technical issue related to running a local LLM, which falls under commentary/discussion rather than a new release or significant event.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/KneelB4S8n ·

    How do I improve my T/S

    <!-- SC_OFF --><div class="md"><p>I have a laptop with 5070 Ti (12GB VRAM), 32Gb of ram, Intel core ultra 9 275HX and Windows 11 amd I am using llama-server. </p> <p>I see people with 6 GB of VRAM running MoEs with 30-40 t/s but I cannot push my Qwen3.6-35B-A3B-Q6\_K\_P above 37 …