New research proposes budget-optimal data collection for machine learning

By PulseAugur Editorial · [1 sources] · 2026-06-17 04:00

A new research paper published on arXiv details a method for optimal data collection from multiple, heterogeneous sources under a fixed budget. The proposed approach maximizes effective sample size by considering the costs associated with different data sources and their group compositions, using a sampling plan that minimizes $\chi^2$-divergence. This method is paired with a post-stratification estimator to achieve budgeted minimax optimal risk for estimating population and group-conditional means, and can be extended to prediction problems. AI

RANK_REASON Research paper published on arXiv detailing a new methodology. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv stat.ML →

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv stat.ML TIER_1 English(EN) · Michael O. Harding, Vikas Singh, Kirthevasan Kandasamy · 2026-06-17 04:00

Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

arXiv:2602.17894v2 Announce Type: replace Abstract: Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use case…

COVERAGE [1]

Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

RELATED ENTITIES

RELATED TOPICS