English(EN) Llama bench and real performance wayy different(Help)

用户报告 Llama 基准测试与实际使用性能差距显著

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-18 10:25

Reddit r/LocalLLaMA 版块的一名用户在使用 Qwen 3.6-35B-A3B IQ4_XS 模型时，遇到了基准测试性能与实际生成速度之间存在显著差异的问题。尽管基准测试显示提示评估和生成速度的每秒 token 数都很高，但实际使用中性能却慢得多，提示评估为每 token 7.79 毫秒（128.30 tokens/秒），生成速度为每 token 125.31 毫秒（7.98 tokens/秒）。该用户正在寻求帮助，以找出其配置中可能存在的错误配置或问题，其配置包括一块 8GB VRAM 的 NVIDIA GeForce RTX 4060 Laptop GPU 和 16GB RAM，并运行着特定的 llama 服务器配置。 AI

影响凸显了本地 LLM 部署和性能调优中可能存在的问题。

排序理由用户关于在本地硬件上针对特定 LLM 进行性能调优的查询。

在 r/LocalLLaMA 阅读 →

产品

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

r/LocalLLaMA TIER_1 English(EN) · /u/Ok-Health-7096 · 2026-06-18 10:25

Llama bench and real performance wayy different(Help)

<div class="md"><p>I had been using qwen 3.6-35b-a3b iq3xxs for past couple of days at 900tk/s prefil and ~40tk/s gen but it hallucinated alot would get facts wrong and what not. I decided to switch to iq4xs for better accuracy and thought even if I get 25tk/s it w…

报道来源 [1]

Llama bench and real performance wayy different(Help)

相关实体

相关话题