Researchers have introduced BLUEX v2, an updated benchmark designed to evaluate Large Language Models (LLMs) on open-ended questions from Brazilian university entrance exams. This new version expands upon the original BLUEX by incorporating questions from the second-phase exams of UNICAMP and USP, which require free-form written responses. The dataset includes 395 questions with associated images, subject areas, reference answers, and cognitive capability tags, and has been used to test 21 state-of-the-art LLMs. Results indicate a performance spread of nearly 5 points on a 0-10 scale, with mathematical reasoning and image understanding proving to be the most challenging areas for current models. AI
IMPACT This benchmark will help researchers better understand and improve LLM performance on complex, open-ended tasks in Portuguese, particularly in academic contexts.
RANK_REASON The cluster describes a new academic benchmark and dataset for evaluating LLMs, which falls under research. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →