New benchmark tests coding agents on browser game creation

By PulseAugur Editorial · [1 sources] · 2026-05-25 04:00

Researchers have introduced WebGameBench, a new benchmark designed to evaluate coding agents' ability to create functional browser-based games from specifications. This benchmark focuses on the delivered application rather than just source code, assessing if agents can transform a frozen specification into a playable game. Initial tests across 12 agents and 111 tasks show that while the best agent achieved a 76.9% usable rate, only 20.2% were rated as excellent, highlighting the gap between basic functionality and full requirement satisfaction. AI

IMPACT Establishes a new evaluation standard for coding agents, pushing them beyond code generation to functional application delivery.

RANK_REASON The cluster describes a new academic benchmark for evaluating AI systems. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New benchmark tests coding agents on browser game creation

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Wenyu Zhang, Guoliang You, Tianlun, Haotian Zhao, Tianshu Zhu, Haoran Wang, Xiaoxuan Tang, Mingyang Dai, Jingnan Gu, Daxiang Dong, Jianmin Wu · 2026-05-25 04:00

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

arXiv:2605.17637v2 Announce Type: replace Abstract: Coding agents are increasingly used as application builders, yet many evaluations still focus on source code, repository-level tests, or intermediate traces rather than the delivered application. We introduce WebGameBench, a req…

COVERAGE [1]

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

RELATED ENTITIES

RELATED TOPICS