PulseAugur
EN
LIVE 12:06:33

New MetaSyn benchmark reveals LLM agents struggle with study selection

Researchers have developed a new benchmark called MetaSyn to evaluate Large Language Model (LLM) agents on the complex task of meta-analysis. The benchmark consists of 442 expert-curated meta-analyses from Nature Portfolio journals, including detailed criteria, a large corpus of PubMed articles, and verified positive and negative studies. Initial testing revealed that current LLM agents struggle significantly with the study selection phase, failing to reliably identify eligible literature from topically similar but ineligible distractors, despite strong retrieval capabilities. AI

IMPACT Highlights a critical bottleneck in LLM agent capabilities for scientific reasoning, particularly in complex information synthesis tasks.

RANK_REASON The cluster contains an academic paper introducing a new benchmark dataset and evaluation methodology for LLM agents.

Read on arXiv cs.IR (Information Retrieval) →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.CL TIER_1 English(EN) · Anzhe Xie, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai ·

    Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

    arXiv:2606.17041v1 Announce Type: new Abstract: Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating s…

  2. arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Qingyao Ai ·

    Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

    Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing ben…