Common Crawl
PulseAugur coverage of Common Crawl — every cluster mentioning Common Crawl across labs, papers, and developer communities, ranked by signal.
3 天有情绪数据
-
Infini-News offers fast search for 1.3B news articles
Researchers have developed Infini-News, a toolkit and index designed to provide efficient access to over 1.3 billion news articles from the Common Crawl archive. This new resource includes cleaned text, structured metad…
-
FutureSim benchmark tests AI forecasting with historical data
Researchers from the Max Planck Institute have introduced FutureSim, a new benchmark designed to evaluate AI agents' ability to predict real-world events using only historical web data. This method prevents agents from …
-
Elsevier sues Meta over AI training data, citing copyright infringement
Academic publishing giant Elsevier, along with other publishers and authors, has filed a lawsuit against Meta, accusing the company of illegally scraping and using copyrighted research papers to train its Llama large la…
-
LLM-generated content is rapidly growing on the web, study finds
A new research paper introduces DeGenTWeb, a system designed to systematically identify websites dominated by content generated by large language models (LLMs) with minimal human oversight. The study found that LLM-domi…
-
News publishers demand Common Crawl block AI training on their content
News publishers are demanding that Common Crawl cease its unauthorized scraping of web content and prevent AI companies from using this data for model training. The News/Media Alliance has formally communicated this dem…
-
Google warns of increasing, unsophisticated AI prompt injection attacks
Google Threat Intelligence researchers have identified an increase in indirect prompt injection attacks targeting AI systems that browse the web. While many of these attacks are currently low in sophistication and harml…
-
Interactive guide explains how large language models like ChatGPT are built
A new interactive visual guide, based on Andrej Karpathy's lecture, explains the intricate process of building large language models. It details the journey from collecting vast amounts of internet text to the final sta…
-
Researchers unveil PermaFrost-Attack for latent LLM poisoning during pretraining
Researchers have introduced PermaFrost-Attack, a novel method for embedding hidden vulnerabilities, termed 'logic landmines,' into large language models during their pretraining phase. This attack, known as Stealth Pret…