Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval
Researchers have developed a new framework for fashion image retrieval that leverages multi-modal large language models (LLMs) and a two-stage fine-tuning strategy. This approach integrates models like LLaVA to generate attribute-aware triplets and uses pretrained vision-language models such as CLIP-ViT/B32 for enhanced contrastive learning. The method aims to improve compositional reasoning and fine-grained retrieval by addressing limitations in existing approaches, such as scarce annotated data and simplistic negative sampling. AI
IMPACT This research could lead to more sophisticated image search and recommendation systems in the fashion industry.