PulseAugur
实时 03:40:16

New QuestBench benchmark reveals AI failures in humanities

Researchers have developed QuestBench, a new benchmark designed to teach students about AI by having them construct and evaluate AI systems. This method encourages students to define what constitutes a trustworthy answer, moving beyond simply using AI as a productivity tool. The benchmark, comprising 256 questions across 14 humanities and social science domains, revealed significant failures in current AI systems, with the best performer, GPT-5.5, achieving only a 57.58% pass rate. AI

影响 Highlights the limitations of current AI in complex knowledge domains, emphasizing the need for better evaluation methods.

排序理由 The cluster describes a new academic paper introducing a novel benchmark for evaluating AI systems.

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

报道来源 [2]

  1. arXiv cs.AI TIER_1 English(EN) · Haiyang Shen, Jiuzheng Wang, Taian Guo, Mugeng Liu, Wenchun Jing, Chongyang Pan, Siqi Zhong, Zhiyang Chen, Weichen Bi, Yudong Han, Xiaoying Bai, Yun Ma ·

    Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

    arXiv:2605.21413v2 Announce Type: new Abstract: As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a set…

  2. arXiv cs.AI TIER_1 English(EN) · Yun Ma ·

    Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

    As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and unde…