Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
Two new benchmarks, DRA-Bank and ADRA-Bank, have been released to evaluate the capabilities of deep research agents (DRAs). These benchmarks aim to assess DRAs on tasks that mimic the work of management consultants and academic researchers, moving beyond simple retrieval to include planning, reasoning, and handling complex prompts with embedded cognitive traps. Early evaluations using these benchmarks reveal that current frontier agents like Claude Opus 4.6, OpenAI o3-deep-research, and Google Gemini 3.1 Pro struggle to meet acceptance thresholds, exhibiting distinct failure modes such as fabrication, propagation of errors, or inconsistent performance. AI
IMPACT These benchmarks highlight the current limitations of AI agents in complex, real-world research tasks, guiding future development towards more robust reasoning and planning capabilities.