PulseAugur
EN
LIVE 09:21:05

New HK-LegiCoST Corpus Aids Speech Translation Research

Researchers have introduced HK-LegiCoST, a new parallel corpus designed for speech translation research. This corpus features over 600 hours of Cantonese audio, its corresponding traditional Chinese transcript, and an English translation, all aligned at the sentence level. A key challenge addressed was aligning non-verbatim transcripts, which are common when spoken and written language forms differ significantly, making it suitable for languages with vernacular and dialectal speech variations. The corpus enables the demonstration of competitive speech translation baselines and cross-corpus results. AI

RANK_REASON The cluster describes a new academic paper introducing a parallel corpus for speech translation research. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.CL TIER_1 English(EN) · Cihan Xiao, Henry Li Xinyuan, Jinyi Yang, Dongji Gao, Matthew Wiesner, Kevin Duh, Sanjeev Khudanpur ·

    HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation

    arXiv:2306.11252v2 Announce Type: replace Abstract: We introduce HK-LegiCoST, a new three-way parallel corpus of Cantonese-English translations, containing 600+ hours of Cantonese audio, its standard traditional Chinese transcript, and English translation, segmented and aligned a…