PulseAugur
EN
LIVE 19:54:13

New BLUEX v2 benchmark evaluates LLMs on complex Brazilian university exam questions

Researchers have introduced BLUEX v2, an updated benchmark designed to evaluate Large Language Models (LLMs) on open-ended questions from Brazilian university entrance exams. This new version expands upon the original BLUEX by incorporating questions from the second-phase exams of UNICAMP and USP, which require free-form written responses. The dataset includes 395 questions with associated images, subject areas, reference answers, and cognitive capability tags, and has been used to test 21 state-of-the-art LLMs. Results indicate a performance spread of nearly 5 points on a 0-10 scale, with mathematical reasoning and image understanding proving to be the most challenging areas for current models. AI

IMPACT This benchmark will help researchers better understand and improve LLM performance on complex, open-ended tasks in Portuguese, particularly in academic contexts.

RANK_REASON The cluster describes a new academic benchmark and dataset for evaluating LLMs, which falls under research. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New BLUEX v2 benchmark evaluates LLMs on complex Brazilian university exam questions

COVERAGE [1]

  1. arXiv cs.CL TIER_1 English(EN) · Jo\~ao Guilherme Alves Santos, Giovana Kerche Bon\'as, Thiago Laitz, Thales Sales Almeida, Helio Pedrini ·

    BLUEX v2: Benchmarking LLMs on Open-Ended Questions from Brazilian University Entrance Exams

    arXiv:2606.22723v2 Announce Type: replace Abstract: Although Large Language Models (LLMs) excel in many tasks, their assessment in Portuguese has received less attention, particularly for open-ended, discursive tasks that demand deeper reasoning and generation capabilities. While…