Researchers have developed GRASP, a novel method for attributing data in large-scale pretraining. Unlike previous additive approaches, GRASP models subset dynamics and interactions through a quadratic geometric penalty. This interaction-aware surrogate is designed for efficiency at pretraining scale, using low-dimensional feature sketches and a finite lower-confidence bound selection protocol. Evaluations show GRASP significantly outperforms existing methods in subset-retraining fidelity and reduces artifact construction costs, with demonstrated utility in language model curation and vision dataset selection. AI
IMPACT GRASP offers a more efficient and effective way to curate massive pretraining datasets, potentially improving downstream model performance and reducing computational costs.
RANK_REASON The cluster contains a research paper detailing a new method for data attribution in machine learning pretraining. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →