PulseAugur
实时 14:36:32
English(EN) On the Limits of LLM-as-Judge for Scientific Novelty Assessment

研究发现:大型语言模型在评估科学新颖性方面不可靠

一篇新近发表在arXiv上的研究评估了大型语言模型(LLMs)在评估科学研究问题新颖性方面的可靠性。研究人员开发了一个名为RQ-Bench的基准,使用近期的arXiv论文将LLM生成的问句与作者锚定的参考问句进行比较。研究结果表明,LLMs持续高估了生成研究问题的新颖性,制造了一种与人类专家评估相悖的“新颖性幻觉”。LLMs在生成的问句中也倾向于忽略诸如狭窄性或来源绑定等关键维度,这引发了对其在科学评估中应用的担忧。 AI

影响 引发了对LLM在细致科学评估方面当前能力的担忧,可能减缓其在研究评估中的应用。

排序理由 该集群包含一篇详细介绍新基准和LLM能力评估的学术论文。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

报道来源 [3]

  1. arXiv cs.AI TIER_1 English(EN) · Soumitra Sinhahajari, Navonil Majumder, Soujanya Poria ·

    On the Limits of LLM-as-Judge for Scientific Novelty Assessment

    arXiv:2606.12071v1 Announce Type: cross Abstract: LLMs are increasingly used to generate and judge scientific ideas. This makes novelty evaluation a central problem. Full idea evaluation is difficult because it often requires judging a method, its feasibility, and its empirical p…

  2. arXiv cs.AI TIER_1 English(EN) · Soujanya Poria ·

    关于LLM作为评委评估科学新颖性的局限性

    LLMs are increasingly used to generate and judge scientific ideas. This makes novelty evaluation a central problem. Full idea evaluation is difficult because it often requires judging a method, its feasibility, and its empirical promise. We therefore study a cleaner upstream obje…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    关于LLM作为裁判在科学新颖性评估中的局限性

    Research questions generated by large language models exhibit inconsistent novelty assessments when compared to human experts, highlighting concerns about relying on LLMs for scientific novelty evaluation.