English(EN) Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval

新的ParaEval框架改进了LLM知识评估

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-09 10:05

研究人员开发了ParaEval，一个旨在改进大型语言模型评估的新框架。当前的单项选择题问答基准对答案的具体措辞过于敏感，导致对模型真实知识的评估不准确。ParaEval通过使用多个释义的答案选项来查询模型来解决这个问题，从而提供一种更强大的能力衡量标准，而不是仅仅熟悉特定的短语。 AI

影响提供了一种更可靠的评估LLM知识的方法，可能导致更准确的模型开发和比较。

排序理由该集群包含一篇提出LLM新评估方法的学术论文。

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.CL TIER_1 English(EN) · Jo\~ao Maria Janeiro, Mathurin Videau, Andrea Caciolai, Benjamin Piwowarski, Patrick Gallinari, Loic Barrault · 2026-06-10 04:00

Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval

arXiv:2606.10657v1 Announce Type: new Abstract: Multiple-choice (MCQA) benchmarks are the standard for evaluating pretrained large language models, but their reliance on log-likelihood scoring makes them unreliable. Specifically, standard scores are highly sensitive to the exact …
arXiv cs.CL TIER_1 English(EN) · Loic Barrault · 2026-06-09 10:05

Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval

Multiple-choice (MCQA) benchmarks are the standard for evaluating pretrained large language models, but their reliance on log-likelihood scoring makes them unreliable. Specifically, standard scores are highly sensitive to the exact phrasing (surface form) of the answers, conflati…