Researchers have introduced Sci-Rho, a new multilingual benchmark designed to test the robustness of visual-language models (VLMs) on STEM problems. This benchmark includes over 4,200 problem templates across five subjects and seven languages, generating more than 42,000 unique instances. Evaluations of 17 state-of-the-art VLMs revealed a significant gap between average and worst-case accuracy, with smaller models showing more performance degradation across languages compared to larger, proprietary models. AI
影响 Highlights the need for more robust evaluation methods for VLMs, particularly across different languages and visual contexts.
排序理由 The cluster contains an academic paper detailing a new benchmark for evaluating AI models.
AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →