A new benchmark called HalBench has been released to evaluate Large Language Models (LLMs) on their ability to identify and push back against false premises, rather than sycophantically agreeing. In the latest version, 29 open-source models were tested alongside four proprietary models. Qwen-3.6, an open-source model, performed exceptionally well, achieving the highest pushback percentage among all tested open-source models and outperforming larger models and even some proprietary ones like GPT-5.4 and Gemini 3.1 Pro. AI
IMPACT This benchmark highlights the varying ability of LLMs to discern and reject false information, with Qwen-3.6 showing strong performance among open-source options.
RANK_REASON The cluster describes a new benchmark for evaluating LLMs on sycophancy and hallucination, including results for multiple open-source models. [lever_c_demoted from research: ic=1 ai=1.0]
- Sonnet 4.6
- Gemini 3.1 Pro
- Gemma 4
- GPT-5.4
- Grok 4.3
- HalBench
- Meta
- open-source software
- phi-4
- Qwen-3.6
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →