Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems
Researchers have introduced Sci-Rho, a new multilingual benchmark designed to test the robustness of visual-language models (VLMs) on STEM problems. This benchmark includes over 4,200 problem templates across five subjects and seven languages, generating more than 42,000 unique instances. Evaluations of 17 state-of-the-art VLMs revealed a significant gap between average and worst-case accuracy, with smaller models showing more performance degradation across languages compared to larger, proprietary models. AI
IMPACT Highlights the need for more robust evaluation methods for VLMs, particularly across different languages and visual contexts.