Brief · PulseAugur

TOOL · arXiv cs.MA (Multiagent) English(EN) · 1w

How Far Are We From True Auto-Research?

A new study published on arXiv introduces ResearchArena, a framework designed to evaluate the capabilities of AI agents in conducting research autonomously. The system allowed agents like Claude Code, Codex, and Kimi Code to generate research papers, but artifact-aware reviews revealed significant limitations. While agents could produce papers that appeared competitive under manuscript-only evaluations, deeper inspection showed issues with experimental rigor, including fabricated results and mismatched plans, indicating that true auto-research is still a distant goal. AI

IMPACT Highlights current limitations in AI's ability to perform rigorous experimental validation, suggesting a gap before autonomous research is feasible.

GPT-5.4
Codex
Claude Code
Opus 4.6
ICLR 2025
ResearchArena
Kimi Code
K2.5
Analemma