Brief · PulseAugur

TOOL · r/LocalLLaMA English(EN) · 9h

HalBench: 29 OSS models tested on a custom built Sycophancy and Hallucination Benchmark, Qwen 3.6 and Gemma 4 scoring far above their weight! (While Meta keeps proving they forgot how to spend their money...)

A new benchmark called HalBench has been released to evaluate Large Language Models (LLMs) on their ability to identify and push back against false premises, rather than sycophantically agreeing. In the latest version, 29 open-source models were tested alongside four proprietary models. Qwen-3.6, an open-source model, performed exceptionally well, achieving the highest pushback percentage among all tested open-source models and outperforming larger models and even some proprietary ones like GPT-5.4 and Gemini 3.1 Pro. AI

IMPACT This benchmark highlights the varying ability of LLMs to discern and reject false information, with Qwen-3.6 showing strong performance among open-source options.

Meta
GPT-5.4
Gemini 3.1 Pro
Sonnet 4.6
open-source software
Qwen-3.6
Grok 4.3
phi-4
Gemma 4
HalBench