This paper introduces a new benchmark for machine transliteration between Tajik and Farsi, developing a unique parallel corpus from diverse sources. The study compares six model architectures, including rule-based systems, LSTMs, Transformers, and pre-trained multilingual models. Results show that byte-level and character-level models, particularly ByT5, significantly outperform subword-based models like mT5 for this language pair. AI
IMPACT Highlights the effectiveness of byte/character-level models over subword tokenization for specific transliteration tasks.
RANK_REASON This is a research paper presenting a new benchmark and comparative study of machine learning models for a specific NLP task.
- arXiv
- ByT5
- Farsi
- G2P Transformer
- LSTM
- Masnavi-i Ma'navi
- mBART
- Mullosharaf Arabov Am
- Shahnameh
- Tajik
- Transformer
- mT5
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →