LLMs struggle to reproduce physics experiment results, failing numerical simulations

By PulseAugur Editorial · [1 sources] · 2026-04-27 07:34

A new preprint from Peking University evaluated the ability of large language models to reproduce numerical results from experimental physics papers. Researchers found that all tested LLMs, including OpenAI Codex powered by GPT-5.3, achieved a 0% end-to-end callback rate, meaning they could not replicate any full numerical outcomes. While the models demonstrated strong comprehension of the papers' methodologies, they consistently made errors in data analysis and numerical simulation, leading to incorrect final results. The study identified several failure modes, such as formula implementation errors and oversimplification of complex physical models. AI

IMPACT LLMs struggle with complex numerical simulation and data analysis in scientific research, indicating limitations beyond text comprehension.

RANK_REASON Academic paper evaluating LLM capabilities on a new domain (physics simulation).

Read on LessWrong (AI tag) →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

LLMs struggle to reproduce physics experiment results, failing numerical simulations

COVERAGE [1]

LessWrong (AI tag) TIER_1 English(EN) · fessus · 2026-04-27 07:34

AI Is Bad at Physics

There’s a <a href="https://arxiv.org/pdf/2603.27646">new preprint</a> from Peking University in China that assesses LLM capabilities in reproducing results from experimental physics papers. Their finding? All the agents had a 0% …

COVERAGE [1]

AI Is Bad at Physics

RELATED ENTITIES

RELATED TOPICS