Researchers have introduced ClawMark, a new benchmark designed to evaluate the capabilities of AI agents that function as persistent coworkers. Unlike previous benchmarks that are often static and text-focused, ClawMark simulates a dynamic, multi-day work environment where information changes independently of the agent. The benchmark includes 100 tasks across 13 professional scenarios, utilizing five stateful services and over 1500 deterministic checkers for evaluation. Initial testing showed that while the top-performing agent achieved a weighted score of 75.8, strict task success was only 20.0%, indicating significant challenges in adapting to evolving environmental states. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT New benchmark challenges AI agents to adapt to dynamic, multi-day work environments, highlighting adaptation as a key research area.
RANK_REASON Introduces a new benchmark for evaluating AI agents in a simulated work environment.