PulseAugur
EN
LIVE 12:50:34

New benchmark WebRISE tests MLLM-generated web artifacts

Researchers have developed WebRISE, a new benchmark for evaluating Multi-modal Large Language Models (MLLMs) that generate web artifacts. Unlike previous methods, WebRISE focuses on requirement-induced states and transitions, compiling task requirements into Interaction Contract Graphs (ICGs). The benchmark includes 442 tasks across five input modalities and reveals that even top-performing MLLMs struggle with transition validity and requirement coverage, with visual quality not correlating with functional behavior. AI

IMPACT This benchmark highlights current limitations in MLLMs for web generation, suggesting areas for future model development and evaluation.

RANK_REASON The cluster contains a research paper introducing a new benchmark for evaluating AI models.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

COVERAGE [3]

  1. arXiv cs.AI TIER_1 English(EN) · Yuxin Meng, Yuhan Suo, Junjie Wang, Yuhan Sun, Yiyao Yu, Ruixu Zhang, Ruining Hu, Yubin Wang, Shouwei Ruan, Bin Wang, Yuxiang Zhang, Yujiu Yang ·

    WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

    arXiv:2606.03220v1 Announce Type: cross Abstract: Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task …

  2. arXiv cs.CL TIER_1 English(EN) · Yujiu Yang ·

    WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

    Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICG…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

    WebRISE evaluates MLLM-generated web artifacts by analyzing interaction contracts that capture user intent transitions and requirement checks across multiple input modalities, revealing significant gaps in model performance and demonstrating superior error detection compared to t…