CLIMP: Mamba-based vision-language model surpasses OpenAI's CLIP

By PulseAugur Editorial · [1 sources] · 2026-06-30 04:00

Researchers have introduced CLIMP, a novel contrastive language-image pre-training model that exclusively utilizes the Mamba architecture, moving away from traditional Vision Transformers. This new approach addresses limitations such as quadratic scaling with resolution and susceptibility to spurious correlations found in Vision Transformers. CLIMP demonstrates superior performance in cross-modal retrieval and out-of-distribution robustness compared to OpenAI's CLIP-ViT-B, while also offering improved efficiency in terms of memory and FLOPs. The model's autoregressive text encoder further enhances its capabilities by enabling dense captioning retrieval. AI

IMPACT This research suggests Mamba architectures are a viable and efficient alternative to Transformers for vision-language tasks, potentially influencing future model development.

RANK_REASON Research paper introducing a new model architecture and benchmark results. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

CLIMP: Mamba-based vision-language model surpasses OpenAI's CLIP

COVERAGE [1]

arXiv cs.CV TIER_1 English(EN) · Nimrod Shabtay, Itamar Zimerman, Eli Schwartz, Raja Giryes · 2026-06-30 04:00

CLIMP: Contrastive Language-Image Mamba Pretraining

arXiv:2601.06891v2 Announce Type: replace Abstract: Contrastive Language-Image Pre-training (CLIP) relies on Vision Transformers whose attention mechanism is susceptible to spurious correlations, and scales quadratically with resolution. To address these limitations, We present C…

COVERAGE [1]

CLIMP: Contrastive Language-Image Mamba Pretraining

RELATED ENTITIES

RELATED TOPICS