New synthetic OCR dataset Koshur Pixel released for Kashmiri language

By PulseAugur Editorial · [1 sources] · 2026-06-22 10:42

Researchers have introduced Koshur Pixel, a novel synthetic OCR dataset designed for the Kashmiri language. This dataset contains over 613,000 image-text pairs, generated using the SynthOCR-Gen framework from the KS-PRET-5M corpus. Koshur Pixel aims to address the scarcity of annotated data for low-resource languages like Kashmiri, which presents unique challenges due to its Perso-Arabic Nastaliq script. The dataset includes various fonts, textual granularities, and augmentation strategies to simulate real-world document conditions, facilitating the development of OCR systems and the digitization of Kashmiri textual heritage. AI

IMPACT Enables development of OCR for under-resourced languages, aiding digitization efforts.

RANK_REASON The cluster contains an academic paper detailing a new dataset for a specific language. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New synthetic OCR dataset Koshur Pixel released for Kashmiri language

COVERAGE [1]

arXiv cs.CL TIER_1 English(EN) · Nahfid Nissar · 2026-06-22 10:42

Koshur Pixel: a large-scale synthetic ocr dataset for kashmiri

Optical Character Recognition (OCR) for low-resource languages is often constrained by the lack of annotated training data and the complexity of script-specific rendering. Kashmiri, written primarily in the Perso-Arabic Nastaliq script, presents additional challenges due to conte…

COVERAGE [1]

Koshur Pixel: a large-scale synthetic ocr dataset for kashmiri

RELATED ENTITIES

RELATED TOPICS