Researchers have developed new benchmarks to evaluate the ability of multimodal large language models (MLLMs) to generate code for complex, interactive webpages. The first benchmark, WebIGBench, focuses on real-world websites and assesses code generation for dynamic user interactions like clicks and inputs. The second, I-WebGenBench, specifically targets the conversion of scientific research papers into executable interactive web systems, evaluating the models' capacity to handle dynamic mechanisms and state transitions. AI
IMPACT These benchmarks will drive improvements in LLMs' ability to create functional, interactive web applications and systems from various inputs.
RANK_REASON The cluster contains two new academic papers introducing benchmarks for evaluating LLM code generation capabilities.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →