PulseAugur
实时 14:45:50

Infini-News offers fast search for 1.3B news articles

Researchers have developed Infini-News, a toolkit and index designed to provide efficient access to over 1.3 billion news articles from the Common Crawl archive. This new resource includes cleaned text, structured metadata, language detection, and geographic attribution for each article. The system utilizes Infini-gram indexes, enabling researchers to search the entire archive for text patterns in under a second, thereby facilitating large-scale media research. AI

影响 Lowers the barrier for computational social science and NLP research by providing efficient access to a massive news corpus.

排序理由 Publication of an academic paper detailing a new toolkit and dataset for NLP research. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

Infini-News offers fast search for 1.3B news articles

报道来源 [1]

  1. arXiv cs.CL TIER_1 English(EN) · Kirill Solovev ·

    Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles

    Large-scale news corpora support a wide range of research in Computational Social Science and NLP, yet access remains constrained: commercial archives impose prohibitive costs and licensing restrictions, while open alternatives like Common Crawl's CC-News require terabyte-scale s…