A new study published on arXiv introduces benchmarks for evaluating agentic review systems designed to assist with the peer review process for AI-assisted research. The research evaluated two open-source systems, OpenAIReview and Coarse, alongside a proprietary system, Reviewer3, and a zero-shot baseline, using six different large language models. OpenAIReview combined with GPT-5.5 demonstrated strong performance, achieving 83.0% accuracy in tracking paper quality based on external signals and successfully detecting 71.6% of injected errors in a constructed benchmark. AI
IMPACT These agentic review systems could significantly improve the efficiency and accuracy of academic peer review, potentially speeding up research dissemination.
RANK_REASON The cluster contains an academic paper detailing new benchmarks and evaluations for AI systems. [lever_c_demoted from research: ic=1 ai=1.0]
- arXiv
- Conference on Neural Information Processing Systems
- GPT-5.5
- International Conference on Learning Representations
- OpenAI
- OpenAIReview
- Reviewer3
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →