Koshur Diacritizer: A Byte-Level Sequence-to-Sequence Model for Kashmiri Diacritic Restoration
Researchers have developed Koshur Diacritizer, a byte-level sequence-to-sequence model designed to restore diacritic marks in Kashmiri text. This model addresses the common issue of omitted diacritics in digital Kashmiri, which hinders natural language processing applications. To support this effort, a new dataset of over 23,000 aligned sentence pairs has been released, along with the model and source code, to establish a reproducible baseline for Kashmiri diacritic restoration and to aid research in other low-resource languages. AI