FineVision: Open Data Is All You Need
Researchers have introduced FineVision, a new dataset comprising 24 million samples designed to advance vision-language models (VLMs). This corpus was created by unifying over 200 sources through a semi-automated, human-in-the-loop pipeline that ensures data hygiene, de-duplication, and safety. Models trained on FineVision have demonstrated superior performance compared to those trained on existing open datasets, highlighting the importance of scale and meticulous data curation for VLM development. The dataset and its curation tools are being released to foster further research in data-centric VLM approaches. AI
IMPACT Provides a large, clean dataset to accelerate research and development in vision-language models.