PulseAugur
EN
LIVE 09:40:37

MixTeX system uses synthetic data for efficient LaTeX OCR

Researchers have developed MixTeX, a novel system for LaTeX Optical Character Recognition (OCR) that significantly reduces the need for large, real-world datasets. By employing synthetic pretraining with grammatically correct Wikipedia text paired with LaTeX formulas, MixTeX bypasses the dependency on costly and limited real LaTeX sources. After this synthetic phase, the system requires only a small number of real samples for fine-tuning, outperforming existing methods trained on extensive real datasets while demanding less computational resources and human effort. The developed models and code are publicly available, supporting low-resource languages and offering a more efficient approach to converting scientific document images into editable LaTeX. AI

IMPACT Reduces data requirements for scientific document conversion, potentially enabling broader language support and faster research dissemination.

RANK_REASON The cluster describes a new research paper detailing a novel method for LaTeX OCR, including its methodology, evaluation, and public availability of code and models. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.CV TIER_1 English(EN) · Yuhan Xu, Yijun Zhao, Renqing Luo, Gary M. Weiss ·

    MixTeX: Data-Efficient LaTeX OCR via Synthetic Pretraining and Limited Fine-Tuning

    arXiv:2406.17148v3 Announce Type: replace Abstract: LaTeX OCR converts scientific document images into editable LaTeX code. Existing systems rely on large paired datasets, which are costly to collect and limited for low-resource languages. This paper presents MIXTEX, a data-effic…