PulseAugur
LIVE 15:56:59
tool · [1 source] ·
2
tool

QuestBench benchmark reveals AI failures in humanities and social sciences

Researchers have developed QuestBench, a new benchmark designed to teach students how to evaluate AI systems by having them construct verification tasks. This approach exposes students to the complexities of AI-era knowledge work, encouraging them to define what constitutes a trustworthy AI-generated answer. Evaluations on QuestBench, which covers 14 humanities and social science domains, revealed significant failure rates for current AI systems, with even the top performer, GPT-5.5, achieving only a 57.58% pass rate on student-designed questions. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Highlights the limitations of current AI in nuanced knowledge domains, suggesting a need for improved evaluation methods beyond simple task completion.

RANK_REASON Academic paper introducing a new benchmark and evaluation methodology. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

COVERAGE [1]

  1. arXiv cs.AI TIER_1 · Yun Ma ·

    Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

    As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and unde…