PulseAugur
LIVE 14:37:17
research · [1 source] ·
0
research

New ClawMark benchmark tests multimodal coworker agents in dynamic environments

Researchers have introduced ClawMark, a new benchmark designed to evaluate the capabilities of AI agents that function as persistent coworkers. Unlike previous benchmarks that are often static and text-focused, ClawMark simulates a dynamic, multi-day work environment where information changes independently of the agent. The benchmark includes 100 tasks across 13 professional scenarios, utilizing five stateful services and over 1500 deterministic checkers for evaluation. Initial testing showed that while the top-performing agent achieved a weighted score of 75.8, strict task success was only 20.0%, indicating significant challenges in adapting to evolving environmental states. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT New benchmark challenges AI agents to adapt to dynamic, multi-day work environments, highlighting adaptation as a key research area.

RANK_REASON Introduces a new benchmark for evaluating AI agents in a simulated work environment.

Read on arXiv cs.CV →

COVERAGE [1]

  1. arXiv cs.CV TIER_1 · Fanqing Meng, Lingxiao Du, Zijian Wu, Guanzheng Chen, Xiangyan Liu, Jiaqi Liao, Chonghe Jiang, Zhenglin Wan, Jiawei Gu, Pengfei Zhou, Rui Huang, Ziqi Zhao, Shengyuan Ding, Ailing Yu, Bo Peng, Bowei Xia, Hao Sun, Haotian Liang, Ji Xie, Jiajun Chen, Jiajun ·

    ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

    arXiv:2604.23781v1 Announce Type: new Abstract: Language-model agents are increasingly used as persistent coworkers that assist users across multiple working days. During such workflows, the surrounding environment may change independently of the agent: new emails arrive, calenda…