BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining
Researchers have developed BLISS, a novel method for selecting data to pretrain large language models more efficiently. Unlike previous methods, BLISS does not require external pretrained models and accounts for the long-term impact of data by using a proxy model and a score model. This bilevel optimization approach allows BLISS to predict influence scores for training samples, enabling the selection of high-quality data. Experiments with Pythia and LLaMA models showed that BLISS achieved a 1.7x speedup in reaching target performance compared to state-of-the-art methods. AI
IMPACT Enables faster and more efficient pretraining of large language models by optimizing data selection.