Image Generators are Generalist Vision Learners
Researchers have demonstrated that image generation models can serve as powerful generalist learners for computer vision tasks. By instruction-tuning a model called Nano Banana Pro on a mix of its original data and vision task data, they created Vision Banana. This model achieved state-of-the-art results on segmentation and depth estimation tasks, outperforming specialized models. The findings suggest that training for image generation inherently builds strong visual understanding capabilities, potentially shifting the paradigm in computer vision towards generative pretraining for foundational models. AI
IMPACT Generative pretraining may become central to developing foundational vision models, unifying generation and understanding tasks.