PulseAugur
EN
LIVE 07:24:04

New method identifies LLM web scrapers using unique tokens

Researchers have developed a novel method to automatically identify which large language models (LLMs) are being fed data by specific web scrapers. The technique involves hosting dynamic websites that serve unique "canary tokens" to each visiting scraper. By prompting LLMs and observing if they consistently generate outputs containing these unique tokens, researchers can infer which scrapers are supplying data to which LLMs. Experiments across 22 production LLM systems demonstrated the approach's reliability in identifying previously unknown scraper-LLM connections, offering a way for unprivileged third parties to gain insight into data sourcing and potentially control unwanted scraping. AI

IMPACT Provides a method for identifying data sources for LLMs, potentially enabling better control over web scraping and data provenance.

RANK_REASON The cluster contains an academic paper detailing a new research method. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New method identifies LLM web scrapers using unique tokens

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Emily Wenger ·

    Identifying AI Web Scrapers Using Canary Tokens

    From pre-training to query-time augmentation, web-scraped data helps to improve the quality and contextual relevancy of content generated by large language models (LLMs). However, large-scale web scraping to feed LLMs can affect site stability and raise legal, privacy, or ethics …