DocAtlas: Multilingual Document Understanding Across 80+ Languages
Researchers have introduced DocAtlas, a novel framework designed to improve multilingual document understanding, particularly for low-resource languages. The system constructs high-fidelity OCR datasets and benchmarks across 82 languages using dual pipelines for DOCX and synthetic LaTeX generation. Evaluations of 16 state-of-the-art models highlighted persistent performance gaps in low-resource scripts, but DocAtlas demonstrated that Direct Preference Optimization (DPO) with rendering-derived ground truth can stably adapt models multilingually, improving accuracy without degrading base-language performance. AI
IMPACT Enhances AI's ability to process and understand documents in a wider range of languages, potentially improving global information access and cross-lingual AI applications.