PulseAugur
EN
LIVE 12:51:27

Lombard language corpora suffer from data quality and bias issues

Researchers have audited text corpora for the Lombard language, revealing significant issues with data quality and representation. Despite the appearance of abundant web-scraped data, many datasets suffer from misidentification, boilerplate text, and non-linguistic noise. The analysis also highlighted a severe bias towards Western Lombard varieties, marginalizing Eastern ones and indicating a need for community-driven, variety-aware data curation over simple quantity-based scraping. AI

IMPACT Highlights critical data quality and representation challenges for under-resourced languages, impacting NLP model development.

RANK_REASON The cluster contains an academic paper detailing research findings on language corpora.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.CL TIER_1 English(EN) · Edoardo Signoroni, Pavel Rychl\'y ·

    "Chi nas dal soch el sent de legn" -- Auditing Text Corpora for Lombard

    arXiv:2606.06349v1 Announce Type: new Abstract: Several of the world's languages are still under-resourced in terms of Natural Language Processing (NLP) tools. This is mostly due to the lack of high-quality datasets to train, develop, and evaluate systems and models for several t…

  2. arXiv cs.CL TIER_1 English(EN) · Pavel Rychlý ·

    "Chi nas dal soch el sent de legn" -- Auditing Text Corpora for Lombard

    Several of the world's languages are still under-resourced in terms of Natural Language Processing (NLP) tools. This is mostly due to the lack of high-quality datasets to train, develop, and evaluate systems and models for several tasks, such as Machine Translation (MT). We condu…