New framework enables zero-shot captioning of Indonesian traditional clothing

By PulseAugur Editorial · [1 sources] · 2026-06-12 04:00

Researchers have developed Custom ZeroCLIP, a novel retrieval-augmented vision-language framework designed for the zero-shot captioning of traditional Indonesian clothing. This framework utilizes a combination of CLIP and BERT text encoders with an LSTM decoder, trained on a dataset of 3,800 expert-annotated images. By employing a province-level inductive zero-shot protocol, the model demonstrates strong performance on unseen provinces, achieving a CLIPScore of 0.8536 and outperforming existing baselines. AI

IMPACT This research advances zero-shot learning capabilities for specialized cultural heritage datasets, potentially improving AI's ability to understand and describe diverse cultural artifacts.

RANK_REASON The cluster describes a research paper published on arXiv detailing a new framework for image analysis. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.CV TIER_1 English(EN) · Anugrah Aidin Yotolembah, Novanto Yudistira, Gembong Edhi Setyawan · 2026-06-12 04:00

Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian Clothing

arXiv:2606.13275v1 Announce Type: new Abstract: This paper presents Custom ZeroCLIP, a retrieval-augmented vision-language framework for zero-shot captioning of Indonesian traditional garments. The dataset contains 3,800 expert-annotated images from all 38 Indonesian provinces. U…

COVERAGE [1]

Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian Clothing

RELATED ENTITIES

RELATED TOPICS