PulseAugur
实时 12:14:51

New Mage framework evaluates LLM game scene generation beyond compile rates

Researchers have developed a new evaluation framework called Mage for assessing Large Language Models (LLMs) in generating executable game scenes. This framework moves beyond simple compile-pass rates to include runtime success, structural fidelity, and mechanism adherence across various LLMs and game concepts. Their findings indicate that while direct natural language to code generation achieves high runtime success, it often results in structurally unsound scenes, whereas intermediate representation conditioning improves structural integrity at the cost of runtime success. AI

影响 Introduces a more robust evaluation method for LLM code generation, highlighting limitations of current metrics and suggesting improvements for domain-specific tasks.

排序理由 Academic paper introducing a new evaluation methodology for LLM code generation. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

New Mage framework evaluates LLM game scene generation beyond compile rates

报道来源 [1]

  1. arXiv cs.LG TIER_1 English(EN) · Kıvanç Tatar ·

    Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate

    Compile-pass rate is the dominant evaluation signal for LLM code generation, yet for multi-component domain-specific artifacts it can be actively misleading. We demonstrate this on executable game scene synthesis with a four-axis evaluation protocol (named `Mage') -- compile succ…