New framework uses LLMs for enhanced fashion image retrieval

By PulseAugur Editorial · [1 sources] · 2026-06-19 04:00

Researchers have developed a new framework for fashion image retrieval that leverages multi-modal large language models (LLMs) and a two-stage fine-tuning strategy. This approach integrates models like LLaVA to generate attribute-aware triplets and uses pretrained vision-language models such as CLIP-ViT/B32 for enhanced contrastive learning. The method aims to improve compositional reasoning and fine-grained retrieval by addressing limitations in existing approaches, such as scarce annotated data and simplistic negative sampling. AI

IMPACT This research could lead to more sophisticated image search and recommendation systems in the fashion industry.

RANK_REASON The cluster contains an academic paper detailing a new technical approach. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New framework uses LLMs for enhanced fashion image retrieval

COVERAGE [1]

arXiv cs.CV TIER_1 English(EN) · Nguyen Cao Hoang, Hoang Bui Le, Nam Vo Hoang, Trung-Nghia Le · 2026-06-19 04:00

Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval

arXiv:2606.19684v1 Announce Type: new Abstract: Composed image retrieval retrieves a target image using a composed query of a reference image and a modified text description. In the fashion domain, this task requires understanding subtle attribute variations such as color, patter…

COVERAGE [1]

Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval

RELATED ENTITIES

RELATED TOPICS