PulseAugur / Brief
EN
LIVE 12:31:20

Brief

last 24h
[1/1] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

  1. MixTeX: Data-Efficient LaTeX OCR via Synthetic Pretraining and Limited Fine-Tuning

    Researchers have developed MixTeX, a novel system for LaTeX Optical Character Recognition (OCR) that significantly reduces the need for large, real-world datasets. By employing synthetic pretraining with grammatically correct Wikipedia text paired with LaTeX formulas, MixTeX bypasses the dependency on costly and limited real LaTeX sources. After this synthetic phase, the system requires only a small number of real samples for fine-tuning, outperforming existing methods trained on extensive real datasets while demanding less computational resources and human effort. The developed models and code are publicly available, supporting low-resource languages and offering a more efficient approach to converting scientific document images into editable LaTeX. AI

    IMPACT Reduces data requirements for scientific document conversion, potentially enabling broader language support and faster research dissemination.