New PureDocBench benchmark reveals document parsing is far from solved

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have introduced PureDocBench, a new benchmark for document parsing that addresses issues with the existing OmniDocBench dataset, which suffers from annotation errors and potential contamination. PureDocBench is programmatically generated and source-traceable, offering a more reliable evaluation across clean, digitally degraded, and real-world document settings. Initial evaluations on 40 models reveal that document parsing is far from solved, with significant performance gaps between models and a shared bottleneck in formula recognition. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT PureDocBench provides a more reliable evaluation for document parsing models, highlighting current limitations and guiding future research.

RANK_REASON The cluster describes a new benchmark for evaluating document parsing models, along with findings from its initial application. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Hugging Face Daily Papers →

paper
other

COVERAGE [1]

Hugging Face Daily Papers TIER_1 · 2026-05-08 09:30

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

The past year has seen over 20 open-source document parsing models, yet thefield still benchmarks almost exclusively on OmniDocBench, a 1,355-pagemanually annotated dataset whose top scores have saturated above 90%. Athree-stage audit pipeline we run on OmniDocBench screens its 2…

COVERAGE [1]

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

RELATED ENTITIES

RELATED TOPICS