UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
Researchers have developed UrduMMLU, a new benchmark designed to evaluate the understanding of Urdu language in large language models. This benchmark consists of over 26,000 multiple-choice questions across 26 subjects, sourced from native educational materials. Evaluations show that Gemini-3.5-Flash leads in performance, but many other models, particularly open-source ones, exhibit significant knowledge gaps, especially in humanities and region-specific content. AI
IMPACT Highlights uneven Urdu language understanding in LLMs, particularly for region-specific content, guiding future model development.