PulseAugur
EN
LIVE 06:37:05

HalBench benchmark reveals Qwen-3.6 leads open-source LLMs in resisting falsehoods

A new benchmark called HalBench has been released to evaluate Large Language Models (LLMs) on their ability to identify and push back against false premises, rather than sycophantically agreeing. In the latest version, 29 open-source models were tested alongside four proprietary models. Qwen-3.6, an open-source model, performed exceptionally well, achieving the highest pushback percentage among all tested open-source models and outperforming larger models and even some proprietary ones like GPT-5.4 and Gemini 3.1 Pro. AI

IMPACT This benchmark highlights the varying ability of LLMs to discern and reject false information, with Qwen-3.6 showing strong performance among open-source options.

RANK_REASON The cluster describes a new benchmark for evaluating LLMs on sycophancy and hallucination, including results for multiple open-source models. [lever_c_demoted from research: ic=1 ai=1.0]

Read on r/LocalLLaMA →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

HalBench benchmark reveals Qwen-3.6 leads open-source LLMs in resisting falsehoods

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 English(EN) · /u/Saraozte01 ·

    HalBench: 29 OSS models tested on a custom built Sycophancy and Hallucination Benchmark, Qwen 3.6 and Gemma 4 scoring far above their weight! (While Meta keeps proving they forgot how to spend their money...)

    <table> <tr><td> <a href="https://www.reddit.com/r/LocalLLaMA/comments/1u6y5l5/halbench_29_oss_models_tested_on_a_custom_built/"> <img alt="HalBench: 29 OSS models tested on a custom built Sycophancy and Hallucination Benchmark, Qwen 3.6 and Gemma 4 scoring far above their weight…