PulseAugur
LIVE 06:29:46
research · [2 sources] ·
0
research

New study benchmarks machine transliteration models for Tajik-Farsi languages

This paper introduces a new benchmark for machine transliteration between Tajik and Farsi, developing a unique parallel corpus from diverse sources. The study compares six model architectures, including rule-based systems, LSTMs, Transformers, and pre-trained multilingual models. Results show that byte-level and character-level models, particularly ByT5, significantly outperform subword-based models like mT5 for this language pair. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Highlights the effectiveness of byte/character-level models over subword tokenization for specific transliteration tasks.

RANK_REASON This is a research paper presenting a new benchmark and comparative study of machine learning models for a specific NLP task.

Read on arXiv cs.CL →

COVERAGE [2]

  1. arXiv cs.CL TIER_1 · Mullosharaf K. Arabov ·

    A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures

    arXiv:2605.02270v1 Announce Type: new Abstract: This paper presents the first comprehensive comparative analysis of modern machine learning architectures for transliteration between Tajik (Cyrillic script) and Persian (Arabic script). A key contribution is the creation and valida…

  2. arXiv cs.CL TIER_1 · Mullosharaf K. Arabov ·

    A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures

    This paper presents the first comprehensive comparative analysis of modern machine learning architectures for transliteration between Tajik (Cyrillic script) and Persian (Arabic script). A key contribution is the creation and validation of a unique parallel corpus aggregated from…