New ForMaT dataset targets visually-grounded PDF translation

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-15 09:50

Researchers have introduced ForMaT, a new dataset designed to improve visually-grounded multilingual PDF translation. The dataset comprises 3,956 PDFs across 15 language pairs, meticulously preserving original layout metadata to capture complex elements like tables and formulas. Current machine translation systems exhibit significant weaknesses in maintaining the link between text and its visual context, highlighting the need for layout-aware models that can integrate both visual and textual information for accurate document reconstruction. AI

影响 This dataset aims to improve machine translation systems' ability to handle complex document layouts, potentially leading to more accurate and context-aware translations of visually rich documents.

排序理由 The cluster describes the release of a new academic dataset for a specific NLP task. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Kamil Guttmann · 2026-05-15 09:50

ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation

We present ForMaT (Format-Preserving Multilingual Translation), a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata proposed for multimodal machine translation. To ensure structural diversity in the dataset, we employ K-Medoids samplin…

报道来源 [1]

ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation

相关实体

相关话题