PulseAugur
EN
LIVE 06:59:11

New DSAEval benchmark tests AI data science agents

A new benchmark called DSAEval has been introduced to evaluate data science agents on real-world problems. The benchmark includes multimodal perception, multi-query interactions, and multi-dimensional evaluation across reasoning, code, and results. In evaluations, Claude Sonnet 4.5 performed best overall, while MiMo-V2-Pro and GPT-5.2 excelled in duration and step efficiency, respectively. The study also found that multimodal perception significantly improves performance on vision tasks, though challenges persist in unstructured data domains. AI

IMPACT Establishes a new standard for evaluating AI data science agents, highlighting current limitations and future research directions.

RANK_REASON The cluster describes a new academic paper introducing a benchmark for evaluating AI agents. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Maojun Sun, Yifei Xie, Yue Wu, Ruijian Han, Binyan Jiang, Defeng Sun, Yancheng Yuan, Jian Huang ·

    DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

    arXiv:2601.13591v2 Announce Type: replace Abstract: Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack stand…