PulseAugur
EN
LIVE 19:39:25

New QuestBench benchmark reveals AI failures in humanities

Researchers have developed QuestBench, a new benchmark designed to teach students about AI by having them construct and evaluate AI systems. This method encourages students to define what constitutes a trustworthy answer, moving beyond simply using AI as a productivity tool. The benchmark, comprising 256 questions across 14 humanities and social science domains, revealed significant failures in current AI systems, with the best performer, GPT-5.5, achieving only a 57.58% pass rate. AI

IMPACT Highlights the limitations of current AI in complex knowledge domains, emphasizing the need for better evaluation methods.

RANK_REASON The cluster describes a new academic paper introducing a novel benchmark for evaluating AI systems.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Haiyang Shen, Jiuzheng Wang, Taian Guo, Mugeng Liu, Wenchun Jing, Chongyang Pan, Siqi Zhong, Zhiyang Chen, Weichen Bi, Yudong Han, Xiaoying Bai, Yun Ma ·

    Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

    arXiv:2605.21413v2 Announce Type: new Abstract: As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a set…

  2. arXiv cs.AI TIER_1 English(EN) · Yun Ma ·

    Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

    As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and unde…