This paper introduces a novel three-step methodology for designing and reporting benchmarks specifically for knowledge work performed by AI agents. The approach emphasizes aligning benchmark tasks with real-world work activities, specifying the testing environment, and scoring the actual work product. It draws on occupational studies to define 18 distinct work activities and provides guidance on mapping tasks, detailing settings, and scoring artifacts to ensure benchmark performance accurately reflects deployment capabilities. The paper demonstrates its framework through case studies in areas like non-code deliverables, document analysis, and software engineering. AI
Summary written by gemini-2.5-flash-lite from 1 sources. How we write summaries →
IMPACT Establishes a framework for more realistic AI evaluation in knowledge work domains.
RANK_REASON Academic paper proposing a new methodology for AI benchmarking. [lever_c_demoted from research: ic=1 ai=1.0]