Brief · PulseAugur

TOOL · arXiv cs.AI (TL) · 2w

Combating Data Laundering in LLM Training

A new research paper introduces Synthesis Data Reversion (SDR), a method designed to combat data laundering in Large Language Model (LLM) training. Data laundering involves transforming proprietary data to obscure its origin, making it difficult for rights owners to detect unauthorized use. SDR works by inferring the unknown laundering transformation and synthesizing queries that mimic the laundered data, thereby strengthening detection signals. This approach has shown consistent effectiveness in enhancing data misuse detection across various LLM families and laundering practices, as validated on the MIMIR benchmark. AI

IMPACT This research offers a novel defense against data laundering, potentially protecting intellectual property in AI training data.

LLM
Falcon
Pythia
MIMIR benchmark
Muxing Li
Synthesis Data Reversion