New study benchmarks machine transliteration models for Tajik-Farsi languages

By PulseAugur Editorial · [2 sources] · 2026-05-04 06:24

This paper introduces a new benchmark for machine transliteration between Tajik and Farsi, developing a unique parallel corpus from diverse sources. The study compares six model architectures, including rule-based systems, LSTMs, Transformers, and pre-trained multilingual models. Results show that byte-level and character-level models, particularly ByT5, significantly outperform subword-based models like mT5 for this language pair. AI

IMPACT Highlights the effectiveness of byte/character-level models over subword tokenization for specific transliteration tasks.

RANK_REASON This is a research paper presenting a new benchmark and comparative study of machine learning models for a specific NLP task.

Read on arXiv cs.CL →

paper
other

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New study benchmarks machine transliteration models for Tajik-Farsi languages

COVERAGE [2]

arXiv cs.CL TIER_1 English(EN) · Mullosharaf K. Arabov · 2026-05-05 04:00

A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures

arXiv:2605.02270v1 Announce Type: new Abstract: This paper presents the first comprehensive comparative analysis of modern machine learning architectures for transliteration between Tajik (Cyrillic script) and Persian (Arabic script). A key contribution is the creation and valida…
arXiv cs.CL TIER_1 English(EN) · Mullosharaf K. Arabov · 2026-05-04 06:24

A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures

This paper presents the first comprehensive comparative analysis of modern machine learning architectures for transliteration between Tajik (Cyrillic script) and Persian (Arabic script). A key contribution is the creation and validation of a unique parallel corpus aggregated from…

COVERAGE [2]

A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures

A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures

RELATED ENTITIES

RELATED TOPICS