Researchers have developed MixTeX, a novel system for LaTeX Optical Character Recognition (OCR) that significantly reduces the need for large, real-world datasets. By employing synthetic pretraining with grammatically correct Wikipedia text paired with LaTeX formulas, MixTeX bypasses the dependency on costly and limited real LaTeX sources. After this synthetic phase, the system requires only a small number of real samples for fine-tuning, outperforming existing methods trained on extensive real datasets while demanding less computational resources and human effort. The developed models and code are publicly available, supporting low-resource languages and offering a more efficient approach to converting scientific document images into editable LaTeX. AI
IMPACT Reduces data requirements for scientific document conversion, potentially enabling broader language support and faster research dissemination.
RANK_REASON The cluster describes a new research paper detailing a novel method for LaTeX OCR, including its methodology, evaluation, and public availability of code and models. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →