InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation
Researchers have introduced InteractScience, a new benchmark designed to evaluate the ability of large language models to generate interactive scientific demonstrations. This benchmark combines programmatic functional testing with visually-grounded qualitative testing to assess both the scientific accuracy and the interactive coding capabilities of models. Evaluations of 30 leading LLMs revealed persistent weaknesses in their integration of domain knowledge with interactive front-end development, highlighting the need for further advancements in this area. AI
IMPACT Establishes a new evaluation standard for LLMs in scientific code generation, driving progress in creating interactive educational and research tools.