PulseAugur
EN
LIVE 07:20:45

New Marathi dataset BhashaSetu boosts low-resource translation quality

Researchers have introduced BhashaSetu, a new dataset designed to improve low-resource machine translation for Marathi. The dataset contains 2.78 million sentence pairs across various domains, including stemmed and lemmatized representations for morphology-aware analysis. Experiments show that corpus-level deduplication significantly boosts translation quality, highlighting the importance of data hygiene for morphologically rich languages. The BhashaSetu dataset is now publicly available to support reproducible research in this area. AI

IMPACT This dataset and the findings on data hygiene could significantly improve translation quality for underrepresented languages.

RANK_REASON The cluster describes a new research paper introducing a dataset and methodology for low-resource machine translation.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New Marathi dataset BhashaSetu boosts low-resource translation quality

COVERAGE [2]

  1. arXiv cs.CL TIER_1 English(EN) · Param Thakkar, Anushka Yadav, Michael Tiemann, Abhi Mehta, Akshita Bhasin, Shrinivas Khedkar ·

    BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation

    arXiv:2605.27050v1 Announce Type: new Abstract: We present BhashaSetu, a linguistically enriched English--Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT). Marathi, spoken by over 95 million people, remains underrepr…

  2. arXiv cs.CL TIER_1 English(EN) · Shrinivas Khedkar ·

    BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation

    We present BhashaSetu, a linguistically enriched English--Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT). Marathi, spoken by over 95 million people, remains underrepresented in high-quality parallel corpora across …