Researchers have developed QuestBench, a new benchmark designed to teach students about AI by having them construct and evaluate AI systems. This method encourages students to define what constitutes a trustworthy answer, moving beyond simply using AI as a productivity tool. The benchmark, comprising 256 questions across 14 humanities and social science domains, revealed significant failures in current AI systems, with the best performer, GPT-5.5, achieving only a 57.58% pass rate. AI
IMPACT Highlights the limitations of current AI in complex knowledge domains, emphasizing the need for better evaluation methods.
RANK_REASON The cluster describes a new academic paper introducing a novel benchmark for evaluating AI systems.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →