PulseAugur
LIVE 13:07:17
research · [2 sources] ·
0
research

New framework fuses linguistic knowledge for Vietnamese scene-text image captioning

Researchers have developed a new framework called HSTFG for scene-text image captioning, which aims to improve the fusion of visual, OCR-detected text, and linguistic information. This framework is particularly tailored for Vietnamese, a tonal language where standard approaches struggle due to diacritic ambiguity and OCR errors. The specialized PhonoSTFG model incorporates phonological reasoning, and a new dataset, ViTextCaps, containing over 15,000 images and 74,000 captions, has been created to support this research. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Introduces a specialized framework and dataset for Vietnamese scene-text captioning, potentially improving multimodal AI performance for tonal languages.

RANK_REASON The cluster describes a new academic paper detailing a novel framework and dataset for a specific NLP task.

Read on arXiv cs.CV →

COVERAGE [2]

  1. arXiv cs.CL TIER_1 · Nhi Ngoc-Yen Nguyen, Anh-Duc Nguyen, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen ·

    Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention

    arXiv:2604.27712v1 Announce Type: cross Abstract: Scene-text image captioning requires fusing three information streams -- visual features, OCR-detected text, and linguistic knowledge -- to generate descriptions that faithfully integrate text visible in images. Existing fusion ap…

  2. arXiv cs.CV TIER_1 · Ngan Luu-Thuy Nguyen ·

    Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention

    Scene-text image captioning requires fusing three information streams -- visual features, OCR-detected text, and linguistic knowledge -- to generate descriptions that faithfully integrate text visible in images. Existing fusion approaches treat text as language-agnostic, which fa…