WCXB: A Multi-Type Web Content Extraction Benchmark
Researchers have introduced the Web Content Extraction Benchmark (WCXB), a new dataset designed to improve the evaluation of systems that isolate main content from web pages. The WCXB dataset comprises 2,008 web pages from 1,613 domains, covering seven distinct page types beyond just news articles. Evaluations on this benchmark revealed significant performance disparities among extraction systems, particularly on structured page types, highlighting limitations of existing article-centric benchmarks. AI
IMPACT Provides a more comprehensive evaluation for web content extraction systems, crucial for LLM training and RAG.