Brief · PulseAugur

RESEARCH · arXiv cs.LG English(EN) · 5d · [3 sources]

Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding

Researchers have developed Conan-embedding-v3, a new framework designed to create a unified embedding space for multiple data modalities including text, images, video, documents, and audio. The approach involves training modality-specific models independently, then fusing their task vectors into a single backbone. A key challenge addressed is "Projector Drift," which occurs when fusing models with external encoders, leading to performance degradation in specific modalities like audio. Conan-embedding-v3 employs "Projector Recovery" and multi-modal rehearsal to mitigate this issue, achieving strong performance on benchmarks like MMEB and MAEB. AI

IMPACT Introduces a novel framework for unifying diverse data types into a single embedding space, potentially improving cross-modal retrieval and understanding.

Conan-embedding-v3