PulseAugur
实时 10:18:27
English(EN) WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

新基准WebRISE测试多模态大语言模型生成的Web工件

研究人员开发了WebRISE,一个用于评估生成Web工件的多模态大语言模型(MLLM)的新基准。与以往的方法不同,WebRISE侧重于需求诱导的状态和转换,将任务需求编译成交互契约图(ICGs)。该基准包含442个跨越五种输入模态的任务,并揭示即使是表现最佳的MLLM在转换有效性和需求覆盖率方面也存在困难,视觉质量与功能行为不相关。 AI

影响 该基准突显了MLLM在Web生成方面的当前局限性,为未来模型开发和评估指明了方向。

排序理由 该集群包含一篇介绍用于评估AI模型的新基准的研究论文。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

报道来源 [3]

  1. arXiv cs.AI TIER_1 English(EN) · Yuxin Meng, Yuhan Suo, Junjie Wang, Yuhan Sun, Yiyao Yu, Ruixu Zhang, Ruining Hu, Yubin Wang, Shouwei Ruan, Bin Wang, Yuxiang Zhang, Yujiu Yang ·

    WebRISE:用于 MLLM 生成的 Web 制品的指令诱导状态评估

    arXiv:2606.03220v1 Announce Type: cross Abstract: Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task …

  2. arXiv cs.CL TIER_1 English(EN) · Yujiu Yang ·

    WebRISE:用于 MLLM 生成的 Web 制品的指令诱导状态评估

    Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICG…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

    WebRISE evaluates MLLM-generated web artifacts by analyzing interaction contracts that capture user intent transitions and requirement checks across multiple input modalities, revealing significant gaps in model performance and demonstrating superior error detection compared to t…