DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks
Researchers have developed DailyReport, a new benchmark designed to evaluate the capabilities of search agents (SAs) on realistic, open-ended daily search tasks. Unlike previous benchmarks that focused on specialized scenarios, DailyReport includes 150 tasks with over 3,500 rubrics that reflect current user information needs. The benchmark provides interpretable scores by evaluating tasks through cascaded rubrics across different dimensions, and initial tests on 17 agent systems indicate that current SAs do not yet meet user expectations. AI