PulseAugur
实时 10:17:30

New Sci-Rho benchmark tests VLM robustness on multilingual STEM problems

Researchers have introduced Sci-Rho, a new multilingual benchmark designed to test the robustness of visual-language models (VLMs) on STEM problems. This benchmark includes over 4,200 problem templates across five subjects and seven languages, generating more than 42,000 unique instances. Evaluations of 17 state-of-the-art VLMs revealed a significant gap between average and worst-case accuracy, with smaller models showing more performance degradation across languages compared to larger, proprietary models. AI

影响 Highlights the need for more robust evaluation methods for VLMs, particularly across different languages and visual contexts.

排序理由 The cluster contains an academic paper detailing a new benchmark for evaluating AI models.

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

报道来源 [2]

  1. arXiv cs.AI TIER_1 English(EN) · Muhammad Falensi Azmi, Ikhlasul Akmal Hanif, Vallerie Alexandra Putra, Adi Yeltay, Abdullah Mubarak, Fajri Koto ·

    Sci-Rho:一个多语言、视觉基础的符号化 STEM 问题基准测试

    arXiv:2606.08034v1 Announce Type: cross Abstract: Symbolic benchmarks have emerged as a key approach to assess model robustness under minor modifications to STEM-related questions. However, existing symbolic benchmarks mostly remain limited to mathematical reasoning, lack visual …

  2. arXiv cs.CL TIER_1 English(EN) · Fajri Koto ·

    Sci-Rho:一个多语言、视觉基础的符号化 STEM 问题基准测试

    Symbolic benchmarks have emerged as a key approach to assess model robustness under minor modifications to STEM-related questions. However, existing symbolic benchmarks mostly remain limited to mathematical reasoning, lack visual grounding, and are predominantly in English. In th…