New Mage framework evaluates LLM game scene generation beyond compile rates

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed a new evaluation framework called Mage for assessing Large Language Models (LLMs) in generating executable game scenes. This framework moves beyond simple compile-pass rates to include runtime success, structural fidelity, and mechanism adherence across various LLMs and game concepts. Their findings indicate that while direct natural language to code generation achieves high runtime success, it often results in structurally unsound scenes, whereas intermediate representation conditioning improves structural integrity at the cost of runtime success. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Introduces a more robust evaluation method for LLM code generation, highlighting limitations of current metrics and suggesting improvements for domain-specific tasks.

RANK_REASON Academic paper introducing a new evaluation methodology for LLM code generation. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

COVERAGE [1]

arXiv cs.LG TIER_1 · Kıvanç Tatar · 2026-05-08 06:46

Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate

Compile-pass rate is the dominant evaluation signal for LLM code generation, yet for multi-component domain-specific artifacts it can be actively misleading. We demonstrate this on executable game scene synthesis with a four-axis evaluation protocol (named `Mage') -- compile succ…

COVERAGE [1]

Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate

RELATED ENTITIES

RELATED TOPICS