Researchers have developed UrduMMLU, a new benchmark designed to evaluate the understanding of Urdu language in large language models. This benchmark consists of over 26,000 multiple-choice questions across 26 subjects, sourced from native educational materials. Evaluations show that Gemini-3.5-Flash leads in performance, but many other models, particularly open-source ones, exhibit significant knowledge gaps, especially in humanities and region-specific content. AI
IMPACT Highlights uneven Urdu language understanding in LLMs, particularly for region-specific content, guiding future model development.
RANK_REASON The cluster contains a research paper introducing a new benchmark for evaluating LLMs.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →