WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
Researchers have introduced WebGameBench, a new benchmark designed to evaluate coding agents' ability to create functional browser-based games from specifications. This benchmark focuses on the delivered application rather than just source code, assessing if agents can transform a frozen specification into a playable game. Initial tests across 12 agents and 111 tasks show that while the best agent achieved a 76.9% usable rate, only 20.2% were rated as excellent, highlighting the gap between basic functionality and full requirement satisfaction. AI
IMPACT Establishes a new evaluation standard for coding agents, pushing them beyond code generation to functional application delivery.