PulseAugur
EN
LIVE 11:44:45

New BLISS method speeds up LLM pretraining with efficient data selection

Researchers have developed BLISS, a novel method for selecting data to pretrain large language models more efficiently. Unlike previous methods, BLISS does not require external pretrained models and accounts for the long-term impact of data by using a proxy model and a score model. This bilevel optimization approach allows BLISS to predict influence scores for training samples, enabling the selection of high-quality data. Experiments with Pythia and LLaMA models showed that BLISS achieved a 1.7x speedup in reaching target performance compared to state-of-the-art methods. AI

IMPACT Enables faster and more efficient pretraining of large language models by optimizing data selection.

RANK_REASON The cluster contains an academic paper detailing a new method for language model pretraining. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.LG TIER_1 English(EN) · Jie Hao, Rui Yu, Wei Zhang, Huixia Wang, Jie Xu, Mingrui Liu ·

    BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining

    arXiv:2510.06048v4 Announce Type: replace Abstract: Effective data selection is essential for pretraining large language models (LLMs), enhancing efficiency and improving generalization to downstream tasks. However, existing approaches often require leveraging external pretrained…