Researchers have developed a new dataset, sinhala-ocr-lk-acts-1010, to improve Optical Character Recognition (OCR) for the Sinhala language, which is spoken by approximately 16 million people in Sri Lanka. This dataset comprises 1,010 page-level images and their transcriptions, sourced from Sri Lankan Legislative Acts spanning two decades. Experiments fine-tuning three deep learning models, including DeepSeek-OCR V1, DeepSeek-OCR V2, and LightOnOCR-2-1B, demonstrated that LightOnOCR-2-1B achieved the best performance with a character error rate of 1.05%, significantly outperforming other open-source and commercial OCR solutions. AI
IMPACT Improves OCR capabilities for low-resource languages, potentially aiding historical document digitization and accessibility.
RANK_REASON Academic paper introducing a new dataset and OCR model evaluation. [lever_c_demoted from research: ic=1 ai=1.0]
- DeepSeek-OCR V1
- DeepSeek-OCR V2
- Google Document AI
- LightOnOCR-2-1B
- Nevidu Jayatilleke
- QLoRA
- Sinhala
- sinhala-ocr-lk-acts-1010
- Sri Lanka
- Surya-OCR
- Tesseract v5
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →