Brief · PulseAugur

TOOL · r/LocalLLaMA English(EN) · 7h

Llama bench and real performance wayy different(Help)

A user on Reddit's r/LocalLLaMA subreddit is experiencing a significant discrepancy between benchmarked performance and real-world generation speed for the Qwen 3.6-35B-A3B IQ4_XS model. While benchmarks indicate high token-per-second rates for both prompt evaluation and generation, actual usage shows much slower performance, with a prompt evaluation of 7.79 ms per token (128.30 tokens/sec) and generation at 125.31 ms per token (7.98 tokens/sec). The user is seeking assistance to identify potential misconfigurations or issues with their setup, which includes an NVIDIA GeForce RTX 4060 Laptop GPU with 8GB VRAM and 16GB RAM, and is running a specific llama server configuration. AI

IMPACT Highlights potential issues in local LLM deployment and performance tuning.

Llama
NVIDIA GeForce RTX 4060 Laptop GPU
Qwen 3.6-35B-A3B IQ4_XS