English(EN) GIScholarBench: Benchmarking LLM Overconfidence in GIS Research

大语言模型在地理信息科学研究任务中表现出持续的过度自信

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-06 07:56

一个名为GIScholarBench的新基准已被开发出来，用于评估地理信息科学（GIS）研究中大语言模型的过度自信。该基准包含10,865篇论文，测试模型在元数据检索、文献关联和研究方向生成方面的能力。对Claude Sonnet 4.5、Gemini 3和ChatGPT 5.3的评估显示，所有任务中都存在持续的过度自信，表现为事实过度生成、不可靠的引用扩展以及对输出完整性的过度自信。 AI

影响突出了大语言模型在学术研究中的一个关键局限性，需要改进校准以确保在学术任务中的可靠使用。

排序理由该集群包含一篇介绍大语言模型性能新基准的学术论文。

在 arXiv cs.IR (Information Retrieval) 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Zongrng Li, Mingzheng Yang, Lei Zou, Hongxu Ma, Hao Tian, Siqi Zhou, Wenjing Gong, Kaili Zhang, Bingqian Chen, Mitch Zhang, Yifan Yang · 2026-06-09 04:00

GIScholarBench：在GIS研究中对LLM过度自信进行基准测试

arXiv:2606.08036v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used in academic research workflows, but scholarly tasks require high factual precision and therefore expose a key weakness: overconfidence. Here, overconfidence is defined behaviorall…
arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Yifan Yang · 2026-06-06 07:56

GIScholarBench：在GIS研究中对LLM过度自信进行基准测试

Large language models (LLMs) are increasingly used in academic research workflows, but scholarly tasks require high factual precision and therefore expose a key weakness: overconfidence. Here, overconfidence is defined behaviorally as the tendency to produce confident, assertive,…

报道来源 [2]

GIScholarBench：在GIS研究中对LLM过度自信进行基准测试

GIScholarBench：在GIS研究中对LLM过度自信进行基准测试

相关实体

相关话题