Brief · PulseAugur

TOOL · arXiv cs.LG English(EN) · 9h

BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining

Researchers have developed BLISS, a novel method for selecting data to pretrain large language models more efficiently. Unlike previous methods, BLISS does not require external pretrained models and accounts for the long-term impact of data by using a proxy model and a score model. This bilevel optimization approach allows BLISS to predict influence scores for training samples, enabling the selection of high-quality data. Experiments with Pythia and LLaMA models showed that BLISS achieved a 1.7x speedup in reaching target performance compared to state-of-the-art methods. AI

IMPACT Enables faster and more efficient pretraining of large language models by optimizing data selection.

LLM
LLaMA
Pythia
BLISS