Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 6h

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

A new benchmark called DSAEval has been introduced to evaluate data science agents on real-world problems. The benchmark includes multimodal perception, multi-query interactions, and multi-dimensional evaluation across reasoning, code, and results. In evaluations, Claude Sonnet 4.5 performed best overall, while MiMo-V2-Pro and GPT-5.2 excelled in duration and step efficiency, respectively. The study also found that multimodal perception significantly improves performance on vision tasks, though challenges persist in unstructured data domains. AI

IMPACT Establishes a new standard for evaluating AI data science agents, highlighting current limitations and future research directions.

GPT-5.2
MiMo-V2-Pro
Claude Sonnet 4.5
MiMo-V2-Flash
DSAEval
Maojun Sun