CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia
Researchers have introduced CzechDocs, a new dataset designed to evaluate machine translation systems that preserve document formatting. This dataset includes parallel documents in Czech and several minority languages such as Ukrainian, English, Vietnamese, and Russian, presented in HTML, DOCX, and PDF formats. A portion of the dataset and an evaluation toolkit have been released to facilitate research into format-preserving machine translation. AI
IMPACT Facilitates research into machine translation systems that maintain document formatting, particularly for minority languages.