FalAR corpus boosts European Portuguese ASR with 5,800 hours of parliamentary data

By PulseAugur Editorial · [2 sources] · 2026-05-26 14:14

Researchers have introduced FalAR, a new large-scale speech corpus for European Portuguese parliamentary sessions, aiming to improve Automatic Speech Recognition (ASR) for the language. The corpus contains approximately 5,800 hours of speech data spanning 20 years, with speaker identity annotations for 1,180 individuals. Experiments show that using FalAR for pre-training can lead to a significant improvement in ASR performance, reducing Word Error Rate (WER) by up to 14%. AI

IMPACT This corpus aims to significantly improve ASR performance for European Portuguese, addressing a gap in resources compared to Brazilian Portuguese.

RANK_REASON The cluster contains a research paper detailing a new dataset for ASR.

Read on arXiv cs.CL →

paper
infra

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

FalAR corpus boosts European Portuguese ASR with 5,800 hours of parliamentary data

COVERAGE [2]

arXiv cs.CL TIER_1 English(EN) · Francisco Teixeira, Carlos Carvalho, Mariana Juli\~ao, Catarina Botelho, Rub\'en Solera-Ure\~na, S\'ergio Paulo, Thomas Rolland, Ben Peters, Isabel Trancoso, Alberto Abad · 2026-05-27 04:00

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

arXiv:2605.27062v1 Announce Type: new Abstract: State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented …
arXiv cs.CL TIER_1 English(EN) · Alberto Abad · 2026-05-26 14:14

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented languages and dialectal varieties. Due to having…

COVERAGE [2]

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

RELATED ENTITIES

RELATED TOPICS