WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts
Researchers have developed WebRISE, a new benchmark for evaluating Multi-modal Large Language Models (MLLMs) that generate web artifacts. Unlike previous methods, WebRISE focuses on requirement-induced states and transitions, compiling task requirements into Interaction Contract Graphs (ICGs). The benchmark includes 442 tasks across five input modalities and reveals that even top-performing MLLMs struggle with transition validity and requirement coverage, with visual quality not correlating with functional behavior. AI
IMPACT This benchmark highlights current limitations in MLLMs for web generation, suggesting areas for future model development and evaluation.