Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio
Researchers have developed a new benchmark called MetaSyn to evaluate Large Language Model (LLM) agents on the complex task of meta-analysis. The benchmark consists of 442 expert-curated meta-analyses from Nature Portfolio journals, including detailed criteria, a large corpus of PubMed articles, and verified positive and negative studies. Initial testing revealed that current LLM agents struggle significantly with the study selection phase, failing to reliably identify eligible literature from topically similar but ineligible distractors, despite strong retrieval capabilities. AI
IMPACT Highlights a critical bottleneck in LLM agent capabilities for scientific reasoning, particularly in complex information synthesis tasks.