Brief · PulseAugur

TOOL · arXiv cs.CV English(EN) · 17h

PorTEXTO: A European Portuguese Benchmark for Visual Text Extraction

Researchers have introduced PorTEXTO, a new benchmark designed to improve visual text extraction for European Portuguese (pt-PT). This benchmark addresses the scarcity of resources for pt-PT in existing optical character recognition (OCR) benchmarks, which often focus on high-resource languages or historical texts. PorTEXTO utilizes a pipeline that combines transcriptions from a large language model with human review by native speakers to ensure quality and relevance for contemporary applications. The study found that specialized multilingual data is more effective for pt-PT OCR performance than model size or resolution, highlighting the need for open pt-PT OCR resources. AI

IMPACT This benchmark could improve AI model performance for European Portuguese text extraction, enabling better applications in regions where this language is spoken.

LVLM
optical character recognition
European Portuguese
PorTEXTO