A new benchmark called ProjectionBench has been developed to evaluate the scientific hypothesis generation capabilities of large language models. This framework progressively reveals information from research papers, allowing models to generate hypotheses at each stage. The benchmark was used to assess GPT-5.4, GPT-5, Gemini 2.5 pro, and Gemini 3.1 pro preview across 45 papers. Results indicate that GPT-5.4 and Gemini 3.1 pro show improved performance over their predecessors, with GPT-5.4 maintaining strong alignment with ground truth conclusions even with limited information. AI
IMPACT This benchmark could drive development of LLMs capable of genuine scientific discovery and reasoning.
RANK_REASON The cluster describes a new academic paper introducing a benchmark for evaluating LLMs.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →