New GRASP method improves data attribution for large-scale pretraining

By PulseAugur Editorial · [1 sources] · 2026-06-08 04:00

Researchers have developed GRASP, a novel method for attributing data in large-scale pretraining. Unlike previous additive approaches, GRASP models subset dynamics and interactions through a quadratic geometric penalty. This interaction-aware surrogate is designed for efficiency at pretraining scale, using low-dimensional feature sketches and a finite lower-confidence bound selection protocol. Evaluations show GRASP significantly outperforms existing methods in subset-retraining fidelity and reduces artifact construction costs, with demonstrated utility in language model curation and vision dataset selection. AI

IMPACT GRASP offers a more efficient and effective way to curate massive pretraining datasets, potentially improving downstream model performance and reducing computational costs.

RANK_REASON The cluster contains a research paper detailing a new method for data attribution in machine learning pretraining. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

arXiv

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.LG TIER_1 English(EN) · Yue Min, Ruining Chen, Yujun Li · 2026-06-08 04:00

GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

arXiv:2606.06892v1 Announce Type: new Abstract: Scalable data attribution methods typically assign isolated utility scores to individual training examples. This prevalent additive assumption fundamentally fails to capture critical subset dynamics, including data redundancy and co…

COVERAGE [1]

GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

RELATED ENTITIES

RELATED TOPICS