Knowledge Index of Noah's Ark
A new benchmark called KINA has been developed to evaluate large language models across 261 fine-grained disciplines, addressing issues of scaling-driven design and annotation quality. The benchmark, comprising 899 items, aims for disciplinary representativeness and improved review quality through a novel tournament system. In evaluations of 42 models, Gemini-3.1-Pro-Preview led with 53.17%, followed by Claude-Opus-4.6 and GPT-5.4, indicating significant room for improvement. AI
IMPACT Establishes a new, more rigorous benchmark for LLM evaluation, potentially driving improvements in model capabilities and disciplinary understanding.