New Sinhala OCR Dataset and Model Achieve State-of-the-Art Performance

By PulseAugur Editorial · [1 sources] · 2026-06-30 04:00

Researchers have developed a new dataset, sinhala-ocr-lk-acts-1010, to improve Optical Character Recognition (OCR) for the Sinhala language, which is spoken by approximately 16 million people in Sri Lanka. This dataset comprises 1,010 page-level images and their transcriptions, sourced from Sri Lankan Legislative Acts spanning two decades. Experiments fine-tuning three deep learning models, including DeepSeek-OCR V1, DeepSeek-OCR V2, and LightOnOCR-2-1B, demonstrated that LightOnOCR-2-1B achieved the best performance with a character error rate of 1.05%, significantly outperforming other open-source and commercial OCR solutions. AI

IMPACT Improves OCR capabilities for low-resource languages, potentially aiding historical document digitization and accessibility.

RANK_REASON Academic paper introducing a new dataset and OCR model evaluation. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New Sinhala OCR Dataset and Model Achieve State-of-the-Art Performance

COVERAGE [1]

arXiv cs.CL TIER_1 English(EN) · Avisha Dilhara, Nevidu Jayatilleke · 2026-06-30 04:00

Cross-Temporal Sinhala OCR: Page-Level Adaptation and Diachronic Analysis

arXiv:2606.29378v1 Announce Type: new Abstract: Sinhala is a morphologically rich abugida spoken by roughly 16 million people in Sri Lanka, and to date, there are no publicly available real-world datasets for page-level Sinhala OCR. All previous studies for assessing Sinhala OCR …

COVERAGE [1]

Cross-Temporal Sinhala OCR: Page-Level Adaptation and Diachronic Analysis

RELATED ENTITIES

RELATED TOPICS