Researchers have developed BLUEX v2, a new benchmark designed to evaluate Large Language Models (LLMs) on open-ended questions in Portuguese, specifically drawing from the second-phase entrance exams of Brazil's top universities, UNICAMP and USP. This dataset includes 395 questions from 2022-2025, featuring 919 graded subquestions, with over half accompanied by images. When tested on 21 state-of-the-art LLMs, a performance gap of 4.92 points was observed, with mathematical reasoning and image understanding proving to be the most challenging areas for the models. AI
IMPACT This benchmark will help researchers better assess and improve LLM capabilities in Portuguese for complex, open-ended tasks.
RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating LLMs. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →