Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 1d

Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

Researchers have developed a new method called Gap-K% to detect pretraining data used in large language models. This technique analyzes the gap between a model's top prediction and the actual target token, leveraging the gradient signals that are penalized during training. By incorporating local token correlations, Gap-K% significantly outperforms existing methods on benchmarks like WikiMIA and MIMIR, offering a more robust approach to identifying training data. AI

IMPACT Enhances transparency and accountability in LLM development by providing a tool to identify training data sources.

Large Language Models
MIMIR
Minseo Kwak
Gap-K%
WikiMIA