New Benchmark Tests LLMs on Scientific Hypothesis Generation

By PulseAugur Editorial · [2 sources] · 2026-05-28 17:38

A new benchmark called ProjectionBench has been developed to evaluate the scientific hypothesis generation capabilities of large language models. This framework progressively reveals information from research papers, allowing models to generate hypotheses at each stage. The benchmark was used to assess GPT-5.4, GPT-5, Gemini 2.5 pro, and Gemini 3.1 pro preview across 45 papers. Results indicate that GPT-5.4 and Gemini 3.1 pro show improved performance over their predecessors, with GPT-5.4 maintaining strong alignment with ground truth conclusions even with limited information. AI

IMPACT This benchmark could drive development of LLMs capable of genuine scientific discovery and reasoning.

RANK_REASON The cluster describes a new academic paper introducing a benchmark for evaluating LLMs.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New Benchmark Tests LLMs on Scientific Hypothesis Generation

COVERAGE [2]

arXiv cs.AI TIER_1 English(EN) · A. J. Lew (Unreasonable Labs), Y. Cao (Unreasonable Labs), M. J. Buehler (Unreasonable Labs) · 2026-05-29 04:00

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

arXiv:2605.30284v1 Announce Type: new Abstract: Scientific discovery is an inherently creative and uncertain process, requiring reasoning beyond the recall of known knowledge. While many benchmarks have been proposed to evaluate large language model (LLM) performance on deep rese…
arXiv cs.AI TIER_1 English(EN) · M. J. Buehler · 2026-05-28 17:38

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

Scientific discovery is an inherently creative and uncertain process, requiring reasoning beyond the recall of known knowledge. While many benchmarks have been proposed to evaluate large language model (LLM) performance on deep research tasks via multi-hop retrieval, their innova…

COVERAGE [2]

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

RELATED ENTITIES

RELATED TOPICS