Brief · PulseAugur

TOOL · arXiv cs.CL English(EN) · 21h

MUDIDI: A Two-Stage Framework for Multilingual Dictionary Digitization with Language Models

Researchers have developed MUDIDI, a two-stage framework designed to digitize multilingual dictionaries, particularly those for low-resource languages. The framework addresses challenges like varied scripts, complex layouts, and the preservation of lexicographic structure. MUDIDI's first stage assesses character recognition and markup preservation, while the second stage segments dictionary entries into a machine-readable format. Experiments show that large language models (LLMs) outperform traditional OCR and vision-language models in this task, with performance further enhanced by providing additional contextual information like dictionary introductions. AI

IMPACT This framework could significantly improve access to linguistic resources for endangered languages by enabling better digitization of dictionaries.

LLMs
vision-language models
language models
OCR systems
Ekaterina Vylomova