PulseAugur
实时 10:53:25

New Romanian speech corpus tackles demographic bias in parliamentary ASR

Researchers have developed a new dataset and framework for improving Romanian-accented speech recognition, specifically for parliamentary proceedings. The ROManian PARliamentary Speech Corpus (ROMPAR) includes 17.80 hours of Romanian and Moldavian parliamentary speech, with double annotations and labels for reconstructed word fragments. A multi-task adversarial training framework was implemented to ensure demographic invariance across age, gender, and dialect, along with an LLM-guided decoding strategy for morphological completion of truncated words. This approach significantly reduced word error rate and achieved a 96.6% F1-score in morphological reconstruction. AI

排序理由 The cluster contains an academic paper detailing a new dataset and framework for a specific NLP task. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

报道来源 [1]

  1. arXiv cs.CL TIER_1 English(EN) · Andrei-Marius Avram, Aureliu-Valentin Antonie, \c{S}tefan-Bogdan Badea, Andrei Florea, Robert-Nicolae Zaharoiu, Dumitru-Clementin Cercel ·

    ROMPAR: Morphological Completion and Demographic Unlearning for Romanian-Accented Speech Recognition

    arXiv:2606.15984v1 Announce Type: new Abstract: Automated transcription of parliamentary proceedings faces significant hurdles due to demographic bias, dialectal variation, and technical artifacts such as utterance truncation during segmentation. This paper introduces the ROMania…