GIM: Evaluating models via tasks that integrate multiple cognitive domains
Researchers have introduced the Grounded Integration Measure (GIM), a new benchmark designed to evaluate large language models by integrating multiple cognitive domains. GIM comprises 820 original problems that require coordinating various cognitive operations over accessible knowledge, aiming to assess reasoning grounded in realistic tasks rather than pure memorization or abstract reasoning. The benchmark includes a public-private split for contamination diagnostics and utilizes an IRT model calibrated on over 200,000 prompt-response pairs from 28 models to generate robust ability estimates and a comprehensive leaderboard. AI
IMPACT Introduces a new evaluation framework that moves beyond knowledge recall and abstract reasoning to test integrated cognitive abilities in LLMs.