Spokes framework boosts AI pretraining data diversity by 489%

By PulseAugur Editorial · [1 sources] · 2026-06-16 04:00

Researchers have developed a new probabilistic diversification framework called Spokes, which optimizes for diversity in pretraining data selection. This method utilizes the G-Vendi score and exponentiated gradient descent to create data subsets that are significantly more diverse than random sampling, showing a 489% increase in the G-Vendi score. When applied to datasets like FineWeb and DCLM, Spokes improves downstream performance by an average of 0.4 to 0.5 points over random sampling. Jointly optimizing for both quality and diversity with Spokes yields the strongest results, outperforming baselines by 1.4 to 1.5 points. AI

IMPACT Enhances AI model performance by improving pretraining data diversity and quality.

RANK_REASON The cluster describes a new research paper published on arXiv detailing a novel method for optimizing pretraining data selection. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

paper
infra

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Clarence Lee, Yejin Choi, Luke Zettlemoyer, Pang Wei Koh, Hai Leong Chieu · 2026-06-16 04:00

Spokes: Optimizing for Diverse Pretraining Data Selection

arXiv:2606.15216v1 Announce Type: cross Abstract: Diversity plays a critical role in data selection, improving performance under fixed data budgets by reducing redundancy and repetition. However, optimizing for diversity is inherently challenging, as it is a set-level property th…

COVERAGE [1]

Spokes: Optimizing for Diverse Pretraining Data Selection

RELATED ENTITIES

RELATED TOPICS