New ForMaT dataset targets visually-grounded PDF translation

By PulseAugur Editorial · [1 sources] · 2026-05-15 09:50

Researchers have introduced ForMaT, a new dataset designed to improve visually-grounded multilingual PDF translation. The dataset comprises 3,956 PDFs across 15 language pairs, meticulously preserving original layout metadata to capture complex elements like tables and formulas. Current machine translation systems exhibit significant weaknesses in maintaining the link between text and its visual context, highlighting the need for layout-aware models that can integrate both visual and textual information for accurate document reconstruction. AI

IMPACT This dataset aims to improve machine translation systems' ability to handle complex document layouts, potentially leading to more accurate and context-aware translations of visually rich documents.

RANK_REASON The cluster describes the release of a new academic dataset for a specific NLP task. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.CL TIER_1 English(EN) · Kamil Guttmann · 2026-05-15 09:50

ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation

We present ForMaT (Format-Preserving Multilingual Translation), a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata proposed for multimodal machine translation. To ensure structural diversity in the dataset, we employ K-Medoids samplin…

COVERAGE [1]

ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation

RELATED ENTITIES

RELATED TOPICS