PulseAugur
LIVE 07:22:53
research · [2 sources] ·
0
research

Can Coding Agents Reproduce Findings in Computational Materials Science?

Researchers have developed AutoMat, a new benchmark designed to test the capabilities of AI coding agents in reproducing findings from computational materials science papers. The benchmark evaluates agents on their ability to reconstruct complex scientific workflows, navigate specialized toolchains, and interpret results to support or refute scientific claims. Current LLM-based agents demonstrated low success rates, with the best performing setting achieving only 54.1%, highlighting limitations in handling incomplete procedures and methodological deviations. AI

Summary written by None from 2 sources. How we write summaries →

IMPACT Highlights current limitations of AI agents in scientific reproducibility, suggesting a need for improved domain-specific reasoning and workflow reconstruction.

RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating AI agents.

Read on arXiv cs.CL →

COVERAGE [2]

  1. arXiv cs.CL TIER_1 · Ziyang Huang, Yi Cao, Ali K. Shargh, Jing Luo, Ruidong Mei, Mohd Zaki, Zhan Liu, Wyatt Bunstine, William Jurayj, Somdatta Goswami, Tyrel McQueen, Michael Shields, Jaafar El-Awady, Paulette Clancy, Benjamin Van Durme, Nicholas Andrews, William Walden, Dani ·

    Can Coding Agents Reproduce Findings in Computational Materials Science?

    arXiv:2605.00803v1 Announce Type: cross Abstract: Large language models are increasingly deployed as autonomous coding agents and have achieved remarkably strong performance on software engineering benchmarks. However, it is unclear whether such success transfers to computational…

  2. arXiv cs.CL TIER_1 · Daniel Khashabi ·

    Can Coding Agents Reproduce Findings in Computational Materials Science?

    Large language models are increasingly deployed as autonomous coding agents and have achieved remarkably strong performance on software engineering benchmarks. However, it is unclear whether such success transfers to computational scientific workflows, where tasks require not onl…