Adapting Vision-Language Models from Iconic to Inclusive for Multi-Label Recognition Without Labels
Researchers have developed a new unsupervised framework to adapt vision-language models (VLMs) for more comprehensive multi-label image recognition. The method addresses the tendency of VLMs to focus on a single iconic object, thereby missing other relevant labels in an image. By employing "cutting" and "sewing" stages, the framework enhances the model's ability to identify multiple objects and adjust label distributions without requiring manual annotations. Experiments show this approach significantly outperforms existing unsupervised methods and even some weakly supervised baselines. AI
IMPACT Enables more comprehensive image understanding without manual labeling, potentially improving applications in image search and content moderation.