English(EN) HalBench: 29 OSS models tested on a custom built Sycophancy and Hallucination Benchmark, Qwen 3.6 and Gemma 4 scoring far above their weight! (While Meta keeps proving they forgot how to spend their money...)

HalBench 基准测试显示 Qwen-3.6 在抵抗虚假信息方面领先开源 LLM

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-16 00:19

一个新的名为 HalBench 的基准测试已发布，用于评估大型语言模型 (LLM) 在识别和抵制错误前提方面的能力，而不是谄媚地同意。在最新版本中，对 29 个开源模型和四个专有模型进行了测试。Qwen-3.6，一个开源模型，表现出色，在所有测试的开源模型中实现了最高的抵制百分比，并且其表现优于更大的模型，甚至优于 GPT-5.4 和 Gemini 3.1 Pro 等一些专有模型。 AI

影响该基准测试突显了 LLM 辨别和拒绝虚假信息能力的差异，其中 Qwen-3.6 在开源选项中表现强劲。

排序理由该集群描述了一个用于评估 LLM 谄媚和幻觉的新基准测试，包括多个开源模型的测试结果。[lever_c_demoted from research: ic=1 ai=1.0]

在 r/LocalLLaMA 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

HalBench 基准测试显示 Qwen-3.6 在抵抗虚假信息方面领先开源 LLM

报道来源 [1]

r/LocalLLaMA TIER_1 English(EN) · /u/Saraozte01 · 2026-06-16 00:19

HalBench: 29 OSS models tested on a custom built Sycophancy and Hallucination Benchmark, Qwen 3.6 and Gemma 4 scoring far above their weight! (While Meta keeps proving they forgot how to spend their money...)

<table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1u6y5l5/halbench_29_oss_models_tested_on_a_custom_built/"> <img alt="HalBench: 29 OSS models tested on a custom built Sycophancy and Hallucination Benchmark, Qwen 3.6 and Gemma 4 scoring far above their weight…

报道来源 [1]

HalBench: 29 OSS models tested on a custom built Sycophancy and Hallucination Benchmark, Qwen 3.6 and Gemma 4 scoring far above their weight! (While Meta keeps proving they forgot how to spend their money...)

相关实体

相关话题