A new benchmark called DSAEval has been introduced to evaluate data science agents on real-world problems. The benchmark includes multimodal perception, multi-query interactions, and multi-dimensional evaluation across reasoning, code, and results. In evaluations, Claude Sonnet 4.5 performed best overall, while MiMo-V2-Pro and GPT-5.2 excelled in duration and step efficiency, respectively. The study also found that multimodal perception significantly improves performance on vision tasks, though challenges persist in unstructured data domains. AI
IMPACT Establishes a new standard for evaluating AI data science agents, highlighting current limitations and future research directions.
RANK_REASON The cluster describes a new academic paper introducing a benchmark for evaluating AI agents. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →