Researchers have introduced TexOCR, a new benchmark and training corpus designed to improve Optical Character Recognition (OCR) models for reconstructing scientific documents into compilable LaTeX. Current OCR systems often fail to preserve essential structural elements and LaTeX-specific features, leading to compilation errors. The developed TexOCR-Bench evaluates transcription accuracy, structural integrity, and compilability, while TexOCR-Train provides a large dataset for training. Experiments with a 2B-parameter model demonstrated that reinforcement learning with verifiable rewards significantly enhances performance on structural and compilation metrics compared to supervised fine-tuning alone. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Improves LaTeX reconstruction from PDFs, potentially aiding scientific publishing workflows.
RANK_REASON Academic paper introducing a new benchmark and training corpus for a specific NLP task.