Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 6h

Knowledge Index of Noah's Ark

A new benchmark called KINA has been developed to evaluate large language models across 261 fine-grained disciplines, addressing issues of scaling-driven design and annotation quality. The benchmark, comprising 899 items, aims for disciplinary representativeness and improved review quality through a novel tournament system. In evaluations of 42 models, Gemini-3.1-Pro-Preview led with 53.17%, followed by Claude-Opus-4.6 and GPT-5.4, indicating significant room for improvement. AI

IMPACT Establishes a new, more rigorous benchmark for LLM evaluation, potentially driving improvements in model capabilities and disciplinary understanding.

Anthropic
OpenAI
Google
Gemini-3.1-Pro-Preview
GPT-5.4
Claude-Opus-4.6
KINA