New benchmark evaluates LLMs on interactive scientific code generation

By PulseAugur Editorial · [1 sources] · 2026-05-22 04:00

Researchers have introduced InteractScience, a new benchmark designed to evaluate the ability of large language models to generate interactive scientific demonstrations. This benchmark combines programmatic functional testing with visually-grounded qualitative testing to assess both the scientific accuracy and the interactive coding capabilities of models. Evaluations of 30 leading LLMs revealed persistent weaknesses in their integration of domain knowledge with interactive front-end development, highlighting the need for further advancements in this area. AI

IMPACT Establishes a new evaluation standard for LLMs in scientific code generation, driving progress in creating interactive educational and research tools.

RANK_REASON The cluster describes a new academic paper introducing a benchmark for evaluating LLM capabilities. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Qiaosheng Chen, Yang Liu, Lei Li, Kai Chen, Qipeng Guo, Gong Cheng, Fei Yuan · 2026-05-22 04:00

InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation

arXiv:2510.09724v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly capable of generating complete applications from natural language instructions, creating new opportunities in science and education. In these domains, interactive scientific de…

COVERAGE [1]

InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation

RELATED ENTITIES

RELATED TOPICS