English(EN) On the Limits of LLM-as-Judge for Scientific Novelty Assessment

研究发现：大型语言模型在评估科学新颖性方面不可靠

作者 PulseAugur 编辑部 · [3 个来源] · 2026-06-10 00:00

一篇新近发表在arXiv上的研究评估了大型语言模型（LLMs）在评估科学研究问题新颖性方面的可靠性。研究人员开发了一个名为RQ-Bench的基准，使用近期的arXiv论文将LLM生成的问句与作者锚定的参考问句进行比较。研究结果表明，LLMs持续高估了生成研究问题的新颖性，制造了一种与人类专家评估相悖的“新颖性幻觉”。LLMs在生成的问句中也倾向于忽略诸如狭窄性或来源绑定等关键维度，这引发了对其在科学评估中应用的担忧。 AI

影响引发了对LLM在细致科学评估方面当前能力的担忧，可能减缓其在研究评估中的应用。

排序理由该集群包含一篇详细介绍新基准和LLM能力评估的学术论文。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。我们如何撰写摘要 →

报道来源 [3]

arXiv cs.AI TIER_1 English(EN) · Soumitra Sinhahajari, Navonil Majumder, Soujanya Poria · 2026-06-11 04:00

On the Limits of LLM-as-Judge for Scientific Novelty Assessment

arXiv:2606.12071v1 Announce Type: cross Abstract: LLMs are increasingly used to generate and judge scientific ideas. This makes novelty evaluation a central problem. Full idea evaluation is difficult because it often requires judging a method, its feasibility, and its empirical p…
arXiv cs.AI TIER_1 English(EN) · Soujanya Poria · 2026-06-10 13:34

关于LLM作为评委评估科学新颖性的局限性

LLMs are increasingly used to generate and judge scientific ideas. This makes novelty evaluation a central problem. Full idea evaluation is difficult because it often requires judging a method, its feasibility, and its empirical promise. We therefore study a cleaner upstream obje…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 00:00

关于LLM作为裁判在科学新颖性评估中的局限性

Research questions generated by large language models exhibit inconsistent novelty assessments when compared to human experts, highlighting concerns about relying on LLMs for scientific novelty evaluation.

报道来源 [3]

On the Limits of LLM-as-Judge for Scientific Novelty Assessment

关于LLM作为评委评估科学新颖性的局限性

关于LLM作为裁判在科学新颖性评估中的局限性

相关实体

相关话题