English(EN) BLUEX v2: Benchmarking LLMs on Open-Ended Questions from Brazilian University Entrance Exams

新的 BLUEX v2 基准测试，用于评估大语言模型在复杂的葡萄牙语大学考试问题上的表现

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-21 23:45

研究人员开发了 BLUEX v2，这是一个新的基准测试，旨在评估大语言模型 (LLM) 在葡萄牙语开放式问题上的表现，这些问题专门来自巴西顶尖大学 UNICAMP 和 USP 的第二阶段入学考试。该数据集包含 2022-2025 年的 395 个问题，其中有 919 个带评分的子问题，超过一半的问题附带图像。在对 21 个最先进的大语言模型进行测试时，观察到 4.92 分的性能差距，其中数学推理和图像理解被证明是模型最具挑战性的领域。 AI

影响该基准测试将帮助研究人员更好地评估和提高大语言模型在复杂、开放式葡萄牙语任务上的能力。

排序理由该集群包含一篇介绍大语言模型评估新基准的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

新的 BLUEX v2 基准测试，用于评估大语言模型在复杂的葡萄牙语大学考试问题上的表现

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Helio Pedrini · 2026-06-21 23:45

BLUEX v2: Benchmarking LLMs on Open-Ended Questions from Brazilian University Entrance Exams

Although Large Language Models (LLMs) excel in many tasks, their assessment in Portuguese has received less attention, particularly for open-ended, discursive tasks that demand deeper reasoning and generation capabilities. While the original BLUEX benchmark addressed the scarcity…

报道来源 [1]

BLUEX v2: Benchmarking LLMs on Open-Ended Questions from Brazilian University Entrance Exams

相关实体

相关话题