PulseAugur
EN
LIVE 12:06:58

New model restores diacritics in Kashmiri text

Researchers have developed Koshur Diacritizer, a byte-level sequence-to-sequence model designed to restore diacritic marks in Kashmiri text. This model addresses the common issue of omitted diacritics in digital Kashmiri, which hinders natural language processing applications. To support this effort, a new dataset of over 23,000 aligned sentence pairs has been released, along with the model and source code, to establish a reproducible baseline for Kashmiri diacritic restoration and to aid research in other low-resource languages. AI

RANK_REASON The cluster contains an academic paper detailing a new model for a specific language processing task. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Haq Nawaz Malik, Nahfid Nissar, Faizan Iqbal ·

    Koshur Diacritizer: A Byte-Level Sequence-to-Sequence Model for Kashmiri Diacritic Restoration

    arXiv:2606.15883v1 Announce Type: cross Abstract: Kashmiri, an Indo-Aryan language written in a modified Perso-Arabic script, frequently omits diacritic marks in digital text, creating ambiguity and challenging downstream NLP applications. We present Koshur Diacritizer, a ByT5-sm…