This paper introduces a new benchmark for machine transliteration between Tajik and Farsi, developing a unique parallel corpus from diverse sources. The study compares six model architectures, including rule-based systems, LSTMs, Transformers, and pre-trained multilingual models. Results show that byte-level and character-level models, particularly ByT5, significantly outperform subword-based models like mT5 for this language pair. AI
影响 Highlights the effectiveness of byte/character-level models over subword tokenization for specific transliteration tasks.
排序理由 This is a research paper presenting a new benchmark and comparative study of machine learning models for a specific NLP task.
- arXiv
- ByT5
- Farsi
- G2P Transformer
- LSTM
- Masnavi-i Ma'navi
- mBART
- Mullosharaf Arabov Am
- Shahnameh
- Tajik
- Transformer
- mT5
AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →