PulseAugur
实时 04:13:13

GIM benchmark evaluates LLMs on integrated cognitive tasks

Researchers have introduced the Grounded Integration Measure (GIM), a new benchmark designed to evaluate large language models by integrating multiple cognitive domains. GIM comprises 820 original problems that require coordinating various cognitive operations over accessible knowledge, aiming to assess reasoning grounded in realistic tasks rather than pure memorization or abstract reasoning. The benchmark includes a public-private split for contamination diagnostics and utilizes an IRT model calibrated on over 200,000 prompt-response pairs from 28 models to generate robust ability estimates and a comprehensive leaderboard. AI

影响 Introduces a new evaluation framework that moves beyond knowledge recall and abstract reasoning to test integrated cognitive abilities in LLMs.

排序理由 The cluster describes a new academic paper introducing a novel benchmark for evaluating AI models.

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

GIM benchmark evaluates LLMs on integrated cognitive tasks

报道来源 [2]

  1. arXiv cs.AI TIER_1 English(EN) · Steven McClain ·

    GIM: Evaluating models via tasks that integrate multiple cognitive domains

    As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second d…

  2. Hugging Face Daily Papers TIER_1 English(EN) ·

    GIM: Evaluating models via tasks that integrate multiple cognitive domains

    As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second d…