PulseAugur
实时 11:29:11

TexOCR model reconstructs scientific PDFs into compilable LaTeX

Researchers have introduced TexOCR, a new benchmark and training corpus designed to improve Optical Character Recognition (OCR) models for reconstructing scientific documents into compilable LaTeX. Current OCR systems often fail to preserve essential structural elements and LaTeX-specific features, leading to compilation errors. The developed TexOCR-Bench evaluates transcription accuracy, structural integrity, and compilability, while TexOCR-Train provides a large dataset for training. Experiments with a 2B-parameter model demonstrated that reinforcement learning with verifiable rewards significantly enhances performance on structural and compilation metrics compared to supervised fine-tuning alone. AI

影响 Improves LaTeX reconstruction from PDFs, potentially aiding scientific publishing workflows.

排序理由 Academic paper introducing a new benchmark and training corpus for a specific NLP task.

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

TexOCR model reconstructs scientific PDFs into compilable LaTeX

报道来源 [1]

  1. arXiv cs.CL TIER_1 English(EN) · Chengye Wang, Lin Fu, Zexi Kuang, Yilun Zhao ·

    TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

    arXiv:2604.22880v1 Announce Type: new Abstract: Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compil…