New framework fuses linguistic knowledge for Vietnamese scene-text image captioning

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 2 sources

Researchers have developed a new framework called HSTFG for scene-text image captioning, which aims to improve the fusion of visual, OCR-detected text, and linguistic information. This framework is particularly tailored for Vietnamese, a tonal language where standard approaches struggle due to diacritic ambiguity and OCR errors. The specialized PhonoSTFG model incorporates phonological reasoning, and a new dataset, ViTextCaps, containing over 15,000 images and 74,000 captions, has been created to support this research. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Introduces a specialized framework and dataset for Vietnamese scene-text captioning, potentially improving multimodal AI performance for tonal languages.

RANK_REASON The cluster describes a new academic paper detailing a novel framework and dataset for a specific NLP task.

Read on arXiv cs.CV →

paper
other

COVERAGE [2]

arXiv cs.CL TIER_1 · Nhi Ngoc-Yen Nguyen, Anh-Duc Nguyen, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen · 2026-05-01 04:00

Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention

arXiv:2604.27712v1 Announce Type: cross Abstract: Scene-text image captioning requires fusing three information streams -- visual features, OCR-detected text, and linguistic knowledge -- to generate descriptions that faithfully integrate text visible in images. Existing fusion ap…
arXiv cs.CV TIER_1 · Ngan Luu-Thuy Nguyen · 2026-04-30 10:57

Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention

Scene-text image captioning requires fusing three information streams -- visual features, OCR-detected text, and linguistic knowledge -- to generate descriptions that faithfully integrate text visible in images. Existing fusion approaches treat text as language-agnostic, which fa…

COVERAGE [2]

Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention

Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention

RELATED ENTITIES

RELATED TOPICS