Which Leakage Types Matter? A Quantitative Landscape Across 2,047 Benchmark Datasets
A new research paper quantifies the impact of different data leakage types in machine learning models. The study found that selection leakage, such as peeking at data or seed cherry-picking, significantly inflates reported scores, potentially by 90%. Memorization leakage also increases with model capacity, while estimation and boundary leakage have negligible effects. The findings suggest that selection leakage is the most critical concern for tabular datasets. AI
IMPACT Highlights critical data leakage types that can skew ML benchmark results, urging researchers to focus on selection leakage.