Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality
Researchers have developed a new method called WebGraphMix for selecting pretraining data for language models. This approach leverages the web graph's structure to identify central and peripheral documents, hypothesizing that central hosts offer reusable abstractions and peripheral ones provide specialized knowledge. Experiments show that a 1:1 mixture of central and peripheral data improves average performance across 23 tasks, outperforming uniform sampling and even further enhancing results when combined with document-level quality classifiers. AI
IMPACT This method offers a computationally efficient way to curate pretraining data, potentially improving model performance by leveraging web graph topology.