Researchers have developed QuestBench, a new benchmark designed to teach students about AI by having them construct and evaluate AI systems. This method encourages students to define what constitutes a trustworthy answer, moving beyond simply using AI as a productivity tool. The benchmark, comprising 256 questions across 14 humanities and social science domains, revealed significant failures in current AI systems, with the best performer, GPT-5.5, achieving only a 57.58% pass rate. AI
影响 Highlights the limitations of current AI in complex knowledge domains, emphasizing the need for better evaluation methods.
排序理由 The cluster describes a new academic paper introducing a novel benchmark for evaluating AI systems.
AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →