PulseAugur
EN
LIVE 00:01:28

New method detects LLM pretraining data via black-box API access

Researchers have developed a new method called MC-PDD to detect if specific datasets were used in the pretraining of large language models, even for black-box, closed-source models. This technique, inspired by masked language modeling, masks tokens and assesses the model's prediction accuracy to determine data inclusion. MC-PDD offers performance comparable to existing methods while operating solely through standard API access, enabling applications like model auditing and copyright verification. AI

IMPACT Enables auditing of LLM training data and verification of data copyright using only API access.

RANK_REASON The cluster contains a research paper detailing a new method for detecting pretraining data in LLMs.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Kaixin Lan, Mu You, Tao Fang, Binkai Ou, Lidia S. Chao, Derek F. Wong ·

    MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models

    arXiv:2606.07996v1 Announce Type: cross Abstract: Pretraining is fundamental to the development of Large Language Models (LLMs), yet the opacity of pretraining data complicates model analysis and raises ethical, legal, and fairness concerns. Detecting whether specific datasets we…

  2. arXiv cs.CL TIER_1 English(EN) · Derek F. Wong ·

    MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models

    Pretraining is fundamental to the development of Large Language Models (LLMs), yet the opacity of pretraining data complicates model analysis and raises ethical, legal, and fairness concerns. Detecting whether specific datasets were used during pretraining is, therefore, critical…