PulseAugur
EN
LIVE 14:48:46

Web graph structure guides language model pretraining data selection

Researchers have developed a new method called WebGraphMix for selecting pretraining data for language models. This approach leverages the web graph's structure to identify central and peripheral documents, hypothesizing that central hosts offer reusable abstractions and peripheral ones provide specialized knowledge. Experiments show that a 1:1 mixture of central and peripheral data improves average performance across 23 tasks, outperforming uniform sampling and even further enhancing results when combined with document-level quality classifiers. AI

IMPACT This method offers a computationally efficient way to curate pretraining data, potentially improving model performance by leveraging web graph topology.

RANK_REASON The cluster contains an academic paper detailing a new method for pretraining language models.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Vedant Badoni, Danqi Chen, Xinyi Wang ·

    Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

    arXiv:2606.11499v1 Announce Type: cross Abstract: The performance of modern language models depends critically on pretraining data composition. Yet existing data selection methods rely on auxiliary classifiers for document scoring or mixture optimization, adding computational ove…

  2. arXiv cs.CL TIER_1 English(EN) · Xinyi Wang ·

    Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

    The performance of modern language models depends critically on pretraining data composition. Yet existing data selection methods rely on auxiliary classifiers for document scoring or mixture optimization, adding computational overhead and dependence on labeled data. We propose W…