Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle
A new benchmark series called AARR has been introduced to evaluate the research capabilities of advanced AI agents. The first iteration, AARRI-Bench, tests agents on tasks requiring professionalism, thoroughness, and nuanced reasoning, aspects often missed by current systems. Experiments showed that even the top-performing agent, Mini-SWE-Agent with Claude Opus 4.7, only achieved a 68.3% success rate, highlighting the need for AI to better emulate human research behaviors. AI
IMPACT Highlights limitations in current AI agents' ability to perform nuanced scientific reasoning, indicating a need for further development beyond complex scaffolding.