PulseAugur
EN
LIVE 13:32:48

New UrduMMLU benchmark reveals LLM knowledge gaps

Researchers have developed UrduMMLU, a new benchmark designed to evaluate the understanding of Urdu language in large language models. This benchmark consists of over 26,000 multiple-choice questions across 26 subjects, sourced from native educational materials. Evaluations show that Gemini-3.5-Flash leads in performance, but many other models, particularly open-source ones, exhibit significant knowledge gaps, especially in humanities and region-specific content. AI

IMPACT Highlights uneven Urdu language understanding in LLMs, particularly for region-specific content, guiding future model development.

RANK_REASON The cluster contains a research paper introducing a new benchmark for evaluating LLMs.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Ahmer Tabassum, Sarfraz Ahmad, Hasan Iqbal, Owais Aijaz, Momina Ahsan, Preslav Nakov ·

    UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

    arXiv:2606.07167v1 Announce Type: cross Abstract: Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We introdu…

  2. arXiv cs.CL TIER_1 English(EN) · Preslav Nakov ·

    UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

    Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We introduce UrduMMLU, a benchmark of 26,431 Urdu MCQs acros…