FineGen: A VLM-based Multi-Agent Framework for Fine-Grained Image-Text Dataset Construction
Researchers have developed FineGen, a novel framework that uses vision-language models (VLMs) and a multi-agent system to automatically construct image-text datasets. This system employs a collaborative pipeline for generating, verifying, and correcting data, specifically focusing on creating hard negative samples that are semantically relevant but visually contradictory. The framework has been used to create FineGen-100K, a dataset with over 147,000 hard negatives, which significantly improved accuracy on downstream tasks by 14.4% when used for fine-tuning. AI
IMPACT Enhances fine-grained perception capabilities by providing specialized datasets for training vision-language models.