Researchers have developed QuestBench, a new benchmark designed to teach students how to evaluate AI systems by having them construct verification tasks. This approach exposes students to the complexities of AI-era knowledge work, encouraging them to define what constitutes a trustworthy AI-generated answer. Evaluations on QuestBench, which covers 14 humanities and social science domains, revealed significant failure rates for current AI systems, with even the top performer, GPT-5.5, achieving only a 57.58% pass rate on student-designed questions. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Highlights the limitations of current AI in nuanced knowledge domains, suggesting a need for improved evaluation methods beyond simple task completion.
RANK_REASON Academic paper introducing a new benchmark and evaluation methodology. [lever_c_demoted from research: ic=1 ai=1.0]