PulseAugur
EN
LIVE 08:33:44

New KINA benchmark tests LLMs across 261 disciplines

A new benchmark called KINA has been developed to evaluate large language models across 261 fine-grained disciplines, addressing issues of scaling-driven design and annotation quality. The benchmark, comprising 899 items, aims for disciplinary representativeness and improved review quality through a novel tournament system. In evaluations of 42 models, Gemini-3.1-Pro-Preview led with 53.17%, followed by Claude-Opus-4.6 and GPT-5.4, indicating significant room for improvement. AI

IMPACT Establishes a new, more rigorous benchmark for LLM evaluation, potentially driving improvements in model capabilities and disciplinary understanding.

RANK_REASON The cluster contains a new academic paper introducing a novel benchmark for evaluating LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Sheng Jin, Minghao Liu, Yunze Xiao, Zeqi Zhou, Heli Qi, Yifan Yao, Meishu Song, Kaijing Ma, Xuan Zhang, Sicong Jiang, Yizhe Li, Ningshan Ma, Jie Wei, Ziniu Li, Minglai Yang, Bangya Liu, Yiming Liang, Xiao Fang, Qingcheng Zeng, Jiarui Liu, Rui Yang, Shen … ·

    Knowledge Index of Noah's Ark

    arXiv:2606.05104v1 Announce Type: new Abstract: Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded…