Researchers have developed a new benchmark called MetaSyn to evaluate Large Language Model (LLM) agents on the complex task of meta-analysis. The benchmark consists of 442 expert-curated meta-analyses from Nature Portfolio journals, including detailed criteria, a large corpus of PubMed articles, and verified positive and negative studies. Initial testing revealed that current LLM agents struggle significantly with the study selection phase, failing to reliably identify eligible literature from topically similar but ineligible distractors, despite strong retrieval capabilities. AI
IMPACT Highlights a critical bottleneck in LLM agent capabilities for scientific reasoning, particularly in complex information synthesis tasks.
RANK_REASON The cluster contains an academic paper introducing a new benchmark dataset and evaluation methodology for LLM agents.
Read on arXiv cs.IR (Information Retrieval) →
- Hugging Face
- LLM Agents
- Nature Portfolio
- PubMed
- alphaXiv
- CatalyzeX
- Connected Papers
- DagsHub
- Gotit.pub
- Litmaps
- ScienceCast
- scite Smart Citations
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →