New Marathi dataset BhashaSetu boosts low-resource translation quality

By PulseAugur Editorial · [2 sources] · 2026-05-26 14:03

Researchers have introduced BhashaSetu, a new dataset designed to improve low-resource machine translation for Marathi. The dataset contains 2.78 million sentence pairs across various domains, including stemmed and lemmatized representations for morphology-aware analysis. Experiments show that corpus-level deduplication significantly boosts translation quality, highlighting the importance of data hygiene for morphologically rich languages. The BhashaSetu dataset is now publicly available to support reproducible research in this area. AI

IMPACT This dataset and the findings on data hygiene could significantly improve translation quality for underrepresented languages.

RANK_REASON The cluster describes a new research paper introducing a dataset and methodology for low-resource machine translation.

Read on arXiv cs.CL →

paper
other

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New Marathi dataset BhashaSetu boosts low-resource translation quality

COVERAGE [2]

arXiv cs.CL TIER_1 English(EN) · Param Thakkar, Anushka Yadav, Michael Tiemann, Abhi Mehta, Akshita Bhasin, Shrinivas Khedkar · 2026-05-27 04:00

BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation

arXiv:2605.27050v1 Announce Type: new Abstract: We present BhashaSetu, a linguistically enriched English--Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT). Marathi, spoken by over 95 million people, remains underrepr…
arXiv cs.CL TIER_1 English(EN) · Shrinivas Khedkar · 2026-05-26 14:03

BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation

We present BhashaSetu, a linguistically enriched English--Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT). Marathi, spoken by over 95 million people, remains underrepresented in high-quality parallel corpora across …

COVERAGE [2]

BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation

BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation

RELATED ENTITIES

RELATED TOPICS