PulseAugur
EN
LIVE 10:16:57

New Sci-Rho benchmark tests VLM robustness on multilingual STEM problems

Researchers have introduced Sci-Rho, a new multilingual benchmark designed to test the robustness of visual-language models (VLMs) on STEM problems. This benchmark includes over 4,200 problem templates across five subjects and seven languages, generating more than 42,000 unique instances. Evaluations of 17 state-of-the-art VLMs revealed a significant gap between average and worst-case accuracy, with smaller models showing more performance degradation across languages compared to larger, proprietary models. AI

IMPACT Highlights the need for more robust evaluation methods for VLMs, particularly across different languages and visual contexts.

RANK_REASON The cluster contains an academic paper detailing a new benchmark for evaluating AI models.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Muhammad Falensi Azmi, Ikhlasul Akmal Hanif, Vallerie Alexandra Putra, Adi Yeltay, Abdullah Mubarak, Fajri Koto ·

    Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems

    arXiv:2606.08034v1 Announce Type: cross Abstract: Symbolic benchmarks have emerged as a key approach to assess model robustness under minor modifications to STEM-related questions. However, existing symbolic benchmarks mostly remain limited to mathematical reasoning, lack visual …

  2. arXiv cs.CL TIER_1 English(EN) · Fajri Koto ·

    Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems

    Symbolic benchmarks have emerged as a key approach to assess model robustness under minor modifications to STEM-related questions. However, existing symbolic benchmarks mostly remain limited to mathematical reasoning, lack visual grounding, and are predominantly in English. In th…