DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems
A new benchmark called DSAEval has been introduced to evaluate data science agents on real-world problems. The benchmark includes multimodal perception, multi-query interactions, and multi-dimensional evaluation across reasoning, code, and results. In evaluations, Claude Sonnet 4.5 performed best overall, while MiMo-V2-Pro and GPT-5.2 excelled in duration and step efficiency, respectively. The study also found that multimodal perception significantly improves performance on vision tasks, though challenges persist in unstructured data domains. AI
IMPACT Establishes a new standard for evaluating AI data science agents, highlighting current limitations and future research directions.