PulseAugur
EN
LIVE 09:25:05
tool · [1 source] ·

New benchmark tests coding agents on browser game creation

Researchers have introduced WebGameBench, a new benchmark designed to evaluate coding agents' ability to create functional browser-based games from specifications. This benchmark focuses on the delivered application rather than just source code, assessing if agents can transform a frozen specification into a playable game. Initial tests across 12 agents and 111 tasks show that while the best agent achieved a 76.9% usable rate, only 20.2% were rated as excellent, highlighting the gap between basic functionality and full requirement satisfaction. AI

Summary written by gemini-2.5-flash-lite from 1 sources. How we write summaries →

IMPACT Establishes a new evaluation standard for coding agents, pushing them beyond code generation to functional application delivery.

RANK_REASON The cluster describes a new academic benchmark for evaluating AI systems. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

COVERAGE [1]

  1. arXiv cs.AI TIER_1 · Wenyu Zhang, Guoliang You, Tianlun, Haotian Zhao, Tianshu Zhu, Haoran Wang, Xiaoxuan Tang, Mingyang Dai, Jingnan Gu, Daxiang Dong, Jianmin Wu ·

    WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

    arXiv:2605.17637v2 Announce Type: replace Abstract: Coding agents are increasingly used as application builders, yet many evaluations still focus on source code, repository-level tests, or intermediate traces rather than the delivered application. We introduce WebGameBench, a req…