English(EN) GIM: Evaluating models via tasks that integrate multiple cognitive domains

GIM基准测试在整合认知任务上评估LLM

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-18 17:09

研究人员推出了Grounded Integration Measure (GIM)，这是一个旨在通过整合多个认知域来评估大型语言模型的新基准。GIM包含820个原创问题，需要对可访问的知识进行各种认知操作的协调，旨在评估基于现实任务的推理，而不是纯粹的记忆或抽象推理。该基准包括一个公共-私有划分，用于污染诊断，并利用在28个模型超过200,000个提示-响应对上校准的IRT模型来生成强大的能力估计和全面的排行榜。 AI

影响引入了一个新的评估框架，该框架超越了知识回忆和抽象推理，以测试LLM的整合认知能力。

排序理由该集群描述了一篇介绍用于评估AI模型的新颖基准的学术论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Steven McClain · 2026-05-18 17:09

GIM：通过整合多个认知域的任务来评估模型

As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second d…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-18 17:09

GIM：通过整合多个认知域的任务来评估模型

As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second d…

报道来源 [2]

GIM：通过整合多个认知域的任务来评估模型

GIM：通过整合多个认知域的任务来评估模型

相关实体

相关话题