PulseAugur
EN
LIVE 02:35:55

New BLUEX v2 benchmark tests LLMs on complex Portuguese university exam questions

Researchers have developed BLUEX v2, a new benchmark designed to evaluate Large Language Models (LLMs) on open-ended questions in Portuguese, specifically drawing from the second-phase entrance exams of Brazil's top universities, UNICAMP and USP. This dataset includes 395 questions from 2022-2025, featuring 919 graded subquestions, with over half accompanied by images. When tested on 21 state-of-the-art LLMs, a performance gap of 4.92 points was observed, with mathematical reasoning and image understanding proving to be the most challenging areas for the models. AI

IMPACT This benchmark will help researchers better assess and improve LLM capabilities in Portuguese for complex, open-ended tasks.

RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New BLUEX v2 benchmark tests LLMs on complex Portuguese university exam questions

COVERAGE [1]

  1. arXiv cs.CL TIER_1 English(EN) · Helio Pedrini ·

    BLUEX v2: Benchmarking LLMs on Open-Ended Questions from Brazilian University Entrance Exams

    Although Large Language Models (LLMs) excel in many tasks, their assessment in Portuguese has received less attention, particularly for open-ended, discursive tasks that demand deeper reasoning and generation capabilities. While the original BLUEX benchmark addressed the scarcity…