Researchers have developed Custom ZeroCLIP, a novel retrieval-augmented vision-language framework designed for the zero-shot captioning of traditional Indonesian clothing. This system utilizes a combination of CLIP and BERT text encoders with an LSTM caption decoder, trained on data from 24 Indonesian provinces and evaluated on 8 unseen provinces. The framework achieved strong performance with a CLIPScore of 0.8536, BLEU-4 of 0.3342, and METEOR of 0.4859, demonstrating significant improvements in cultural vocabulary recovery and overall accuracy, particularly in low-resource heritage contexts. AI
IMPACT Advances zero-shot captioning capabilities for cultural heritage data, potentially improving accessibility and analysis of specialized visual datasets.
RANK_REASON The cluster describes a research paper published on arXiv detailing a new framework for image analysis and captioning.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →