PulseAugur
EN
LIVE 04:18:53

New dataset released for format-preserving machine translation in Czechia

Researchers have introduced CzechDocs, a new dataset designed to evaluate machine translation systems that preserve document formatting. This dataset includes parallel documents in Czech and several minority languages such as Ukrainian, English, Vietnamese, and Russian, presented in HTML, DOCX, and PDF formats. A portion of the dataset and an evaluation toolkit have been released to facilitate research into format-preserving machine translation. AI

IMPACT Facilitates research into machine translation systems that maintain document formatting, particularly for minority languages.

RANK_REASON The item describes a new dataset released for research purposes, fitting the 'research' bucket. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New dataset released for format-preserving machine translation in Czechia

COVERAGE [1]

  1. arXiv cs.CL TIER_1 English(EN) · Ondřej Bojar ·

    CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia

    We present CzechDocs, a multiway parallel dataset of formatted documents (HTML, DOCX, and PDF) covering Czech and minority languages used in Czechia-primarily Ukrainian and English, with smaller portions of Vietnamese, Russian and other languages. The dataset is designed to suppo…