Researchers have introduced Koshur Pixel, a novel synthetic OCR dataset designed for the Kashmiri language. This dataset contains over 613,000 image-text pairs, generated using the SynthOCR-Gen framework from the KS-PRET-5M corpus. Koshur Pixel aims to address the scarcity of annotated data for low-resource languages like Kashmiri, which presents unique challenges due to its Perso-Arabic Nastaliq script. The dataset includes various fonts, textual granularities, and augmentation strategies to simulate real-world document conditions, facilitating the development of OCR systems and the digitization of Kashmiri textual heritage. AI
IMPACT Enables development of OCR for under-resourced languages, aiding digitization efforts.
RANK_REASON The cluster contains an academic paper detailing a new dataset for a specific language. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →