Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP
Researchers have re-evaluated the theory that CLIP-like models produce suboptimal image embeddings for image-only tasks due to a focus on language-image alignment over image-image alignment. Their findings suggest that the observed performance differences are not due to intra-modal misalignment but rather task ambiguity. Experiments indicate that models trained with language-image objectives and those trained solely on images yield similar results on intra-modal tasks, challenging the original hypothesis. AI
IMPACT Challenges a common assumption about the limitations of contrastive language-image pre-training, potentially influencing future model development and evaluation strategies.