English(EN) TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

TexOCR 模型将科学 PDF 重建为可编译的 LaTeX

作者 PulseAugur 编辑部 · [1 个来源] · 2026-04-28 04:00

研究人员推出了 TexOCR，这是一个新的基准和训练语料库，旨在改进光学字符识别 (OCR) 模型，以将科学文档重建为可编译的 LaTeX。当前的 OCR 系统通常无法保留重要的结构元素和特定于 LaTeX 的功能，从而导致编译错误。开发的 TexOCR-Bench 评估转录准确性、结构完整性和可编译性，而 TexOCR-Train 则提供了用于训练的大型数据集。使用一个 2B 参数模型的实验表明，与仅进行监督微调相比，具有可验证奖励的强化学习在结构和编译指标上显著提高了性能。 AI

影响改进从 PDF 进行 LaTeX 重建，可能有助于科学出版工作流程。

排序理由介绍特定 NLP 任务新基准和训练语料库的学术论文。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Chengye Wang, Lin Fu, Zexi Kuang, Yilun Zhao · 2026-04-28 04:00

TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

arXiv:2604.22880v1 Announce Type: new Abstract: Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compil…

报道来源 [1]

TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

相关实体

相关话题