Brief · PulseAugur

RESEARCH · arXiv cs.IR (Information Retrieval) English(EN) · 1w · [2 sources]

GIScholarBench: Benchmarking LLM Overconfidence in GIS Research

A new benchmark called GIScholarBench has been developed to evaluate the overconfidence of large language models in Geographic Information Science (GIS) research. The benchmark, comprising 10,865 papers, tests models on metadata retrieval, literature linking, and research direction generation. Evaluations of Claude Sonnet 4.5, Gemini 3, and ChatGPT 5.3 revealed consistent overconfidence across all tasks, manifesting as factual overgeneration, unreliable citation expansion, and overconfidence in output completeness. AI

IMPACT Highlights a critical limitation in LLMs for academic research, necessitating improved calibration for reliable use in scholarly tasks.

Gemini 3
LLMs
Claude Sonnet 4.5
ChatGPT 5.3
GIScholarBench
Large language models