MixTeX: Data-Efficient LaTeX OCR via Synthetic Pretraining and Limited Fine-Tuning
Researchers have developed MixTeX, a novel system for LaTeX Optical Character Recognition (OCR) that significantly reduces the need for large, real-world datasets. By employing synthetic pretraining with grammatically correct Wikipedia text paired with LaTeX formulas, MixTeX bypasses the dependency on costly and limited real LaTeX sources. After this synthetic phase, the system requires only a small number of real samples for fine-tuning, outperforming existing methods trained on extensive real datasets while demanding less computational resources and human effort. The developed models and code are publicly available, supporting low-resource languages and offering a more efficient approach to converting scientific document images into editable LaTeX. AI
IMPACT Reduces data requirements for scientific document conversion, potentially enabling broader language support and faster research dissemination.