A new benchmark called KINA has been developed to evaluate large language models across 261 fine-grained disciplines, addressing issues of scaling-driven design and annotation quality. The benchmark, comprising 899 items, aims for disciplinary representativeness and improved review quality through a novel tournament system. In evaluations of 42 models, Gemini-3.1-Pro-Preview led with 53.17%, followed by Claude-Opus-4.6 and GPT-5.4, indicating significant room for improvement. AI
IMPACT Establishes a new, more rigorous benchmark for LLM evaluation, potentially driving improvements in model capabilities and disciplinary understanding.
RANK_REASON The cluster contains a new academic paper introducing a novel benchmark for evaluating LLMs. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →