PorTEXTO: A European Portuguese Benchmark for Visual Text Extraction
Researchers have introduced PorTEXTO, a new benchmark designed to improve visual text extraction for European Portuguese (pt-PT). This benchmark addresses the scarcity of resources for pt-PT in existing optical character recognition (OCR) benchmarks, which often focus on high-resource languages or historical texts. PorTEXTO utilizes a pipeline that combines transcriptions from a large language model with human review by native speakers to ensure quality and relevance for contemporary applications. The study found that specialized multilingual data is more effective for pt-PT OCR performance than model size or resolution, highlighting the need for open pt-PT OCR resources. AI
IMPACT This benchmark could improve AI model performance for European Portuguese text extraction, enabling better applications in regions where this language is spoken.