PulseAugur
EN
LIVE 15:05:59

User reports major performance gap between Llama benchmarks and real-world use

A user on Reddit's r/LocalLLaMA subreddit is experiencing a significant discrepancy between benchmarked performance and real-world generation speed for the Qwen 3.6-35B-A3B IQ4_XS model. While benchmarks indicate high token-per-second rates for both prompt evaluation and generation, actual usage shows much slower performance, with a prompt evaluation of 7.79 ms per token (128.30 tokens/sec) and generation at 125.31 ms per token (7.98 tokens/sec). The user is seeking assistance to identify potential misconfigurations or issues with their setup, which includes an NVIDIA GeForce RTX 4060 Laptop GPU with 8GB VRAM and 16GB RAM, and is running a specific llama server configuration. AI

IMPACT Highlights potential issues in local LLM deployment and performance tuning.

RANK_REASON User query about performance tuning for a specific LLM on local hardware.

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/Ok-Health-7096 ·

    Llama bench and real performance wayy different(Help)

    <!-- SC_OFF --><div class="md"><p>I had been using qwen 3.6-35b-a3b iq3xxs for past couple of days at 900tk/s prefil and ~40tk/s gen but it hallucinated alot would get facts wrong and what not. I decided to switch to iq4xs for better accuracy and thought even if I get 25tk/s it w…