English(EN) Knowledge Index of Noah's Ark

新的KINA基准测试显示Gemini 3.1 Pro排名最高，超越Claude和GPT-5

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-03 17:06

一项名为KINA的新基准测试被引入，用于评估大型语言模型在261个细粒度学科上的表现，解决了规模驱动设计和标注质量的问题。该基准测试包含899个项目，用于评估来自13个不同实验室的42个模型。Gemini-3.1-Pro-Preview以53.17%的得分成为表现最佳的模型，其次是Claude-Opus-4.6和GPT-5.4，这表明所有模型都有很大的改进空间。 AI

影响为LLM建立了新的评估标准，突出了性能等级和工具增强的影响。

排序理由该集群包含一篇介绍LLM新基准测试并报告评估结果的研究论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

新的KINA基准测试显示Gemini 3.1 Pro排名最高，超越Claude和GPT-5

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Sheng Jin, Minghao Liu, Yunze Xiao, Zeqi Zhou, Heli Qi, Yifan Yao, Meishu Song, Kaijing Ma, Xuan Zhang, Sicong Jiang, Yizhe Li, Ningshan Ma, Jie Wei, Ziniu Li, Minglai Yang, Bangya Liu, Yiming Liang, Xiao Fang, Qingcheng Zeng, Jiarui Liu, Rui Yang, Shen … · 2026-06-04 04:00

诺亚方舟知识索引

arXiv:2606.05104v1 Announce Type: new Abstract: Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded…
arXiv cs.AI TIER_1 English(EN) · Ge Zhang · 2026-06-03 17:06

诺亚方舟知识索引

Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item be…

报道来源 [2]

诺亚方舟知识索引

诺亚方舟知识索引

相关实体

相关话题