PulseAugur
EN
LIVE 08:07:45

New GRASP method improves data attribution for large-scale pretraining

Researchers have developed GRASP, a novel method for attributing data in large-scale pretraining. Unlike previous additive approaches, GRASP models subset dynamics and interactions through a quadratic geometric penalty. This interaction-aware surrogate is designed for efficiency at pretraining scale, using low-dimensional feature sketches and a finite lower-confidence bound selection protocol. Evaluations show GRASP significantly outperforms existing methods in subset-retraining fidelity and reduces artifact construction costs, with demonstrated utility in language model curation and vision dataset selection. AI

IMPACT GRASP offers a more efficient and effective way to curate massive pretraining datasets, potentially improving downstream model performance and reducing computational costs.

RANK_REASON The cluster contains a research paper detailing a new method for data attribution in machine learning pretraining. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.LG TIER_1 English(EN) · Yue Min, Ruining Chen, Yujun Li ·

    GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

    arXiv:2606.06892v1 Announce Type: new Abstract: Scalable data attribution methods typically assign isolated utility scores to individual training examples. This prevalent additive assumption fundamentally fails to capture critical subset dynamics, including data redundancy and co…