Researchers have investigated the modality gap in multi-modal models like CLIP, observing that images and texts often occupy separate distributions in the shared embedding space. This paper demonstrates that this gap can be beneficial for robustness, acting as a feature rather than a bug. By applying a simple post-processing technique to reduce the gap, the models' robustness to perturbations can be significantly increased without sacrificing clean accuracy. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Suggests a method to improve the robustness of existing multi-modal models without performance degradation.
RANK_REASON Academic paper published on arXiv detailing findings about multi-modal model robustness.