Researchers have introduced D3-Gym, a novel dataset designed to create verifiable environments for scientific data-driven discovery tasks. This dataset includes 565 tasks from real scientific repositories, each with instructions, executable environments, and evaluation scripts that align closely with human judgment. Training AI models on D3-Gym has shown significant performance improvements, notably boosting the Qwen3-32B model by 7.8 points on the ScienceAgentBench benchmark. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
IMPACT Provides a new benchmark and training data to improve AI agents for scientific discovery.
RANK_REASON The cluster describes a new academic paper introducing a dataset and its evaluation.