Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 7h

Spokes: Optimizing for Diverse Pretraining Data Selection

Researchers have developed a new probabilistic diversification framework called Spokes, which optimizes for diversity in pretraining data selection. This method utilizes the G-Vendi score and exponentiated gradient descent to create data subsets that are significantly more diverse than random sampling, showing a 489% increase in the G-Vendi score. When applied to datasets like FineWeb and DCLM, Spokes improves downstream performance by an average of 0.4 to 0.5 points over random sampling. Jointly optimizing for both quality and diversity with Spokes yields the strongest results, outperforming baselines by 1.4 to 1.5 points. AI

IMPACT Enhances AI model performance by improving pretraining data diversity and quality.

Hugging Face
arXiv
FineWeb
DCLM
Spokes
G-Vendi score