PulseAugur
EN
LIVE 12:47:54

New synthetic OCR dataset Koshur Pixel released for Kashmiri language

Researchers have introduced Koshur Pixel, a novel synthetic OCR dataset designed for the Kashmiri language. This dataset contains over 613,000 image-text pairs, generated using the SynthOCR-Gen framework from the KS-PRET-5M corpus. Koshur Pixel aims to address the scarcity of annotated data for low-resource languages like Kashmiri, which presents unique challenges due to its Perso-Arabic Nastaliq script. The dataset includes various fonts, textual granularities, and augmentation strategies to simulate real-world document conditions, facilitating the development of OCR systems and the digitization of Kashmiri textual heritage. AI

IMPACT Enables development of OCR for under-resourced languages, aiding digitization efforts.

RANK_REASON The cluster contains an academic paper detailing a new dataset for a specific language. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New synthetic OCR dataset Koshur Pixel released for Kashmiri language

COVERAGE [1]

  1. arXiv cs.CL TIER_1 English(EN) · Nahfid Nissar ·

    Koshur Pixel: a large-scale synthetic ocr dataset for kashmiri

    Optical Character Recognition (OCR) for low-resource languages is often constrained by the lack of annotated training data and the complexity of script-specific rendering. Kashmiri, written primarily in the Perso-Arabic Nastaliq script, presents additional challenges due to conte…