OmniRetriever-7B advances audio-video-text retrieval with fusion distillation

By PulseAugur Editorial · [3 sources] · 2026-05-26 07:26

Researchers have introduced OmniRetriever-7B, a new model designed for any-to-any retrieval across audio, video, and text modalities. The model utilizes a novel fusion-as-teacher distillation technique to improve joint representation learning. In evaluations across six benchmarks, OmniRetriever-7B demonstrated superior performance compared to Gemini Embedding 2, particularly in zero-shot retrieval tasks. AI

IMPACT Enhances cross-modal retrieval capabilities, potentially improving multimodal RAG systems and search functionalities.

RANK_REASON The cluster describes a new research paper detailing a novel model and benchmark for multimodal retrieval.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

OmniRetriever-7B advances audio-video-text retrieval with fusion distillation

COVERAGE [3]

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-26 07:26

OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation

Unified multimodal embedding spaces have become the standard interface for cross-modal retrieval and multimodal RAG, and recent audio-video-text (AVT) encoders extend this setting to three modalities. Such encoders can produce a joint (T,V,A) embedding whenever all three modaliti…
arXiv cs.CV TIER_1 English(EN) · Yunze Liu, Chi-Hao Wu, Enmin Zhou, Junxiao Shen · 2026-05-27 04:00

OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation

arXiv:2605.26641v1 Announce Type: new Abstract: Unified multimodal embedding spaces have become the standard interface for cross-modal retrieval and multimodal RAG, and recent audio-video-text (AVT) encoders extend this setting to three modalities. Such encoders can produce a joi…
arXiv cs.CV TIER_1 English(EN) · Junxiao Shen · 2026-05-26 07:26

OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation

Unified multimodal embedding spaces have become the standard interface for cross-modal retrieval and multimodal RAG, and recent audio-video-text (AVT) encoders extend this setting to three modalities. Such encoders can produce a joint (T,V,A) embedding whenever all three modaliti…

COVERAGE [3]

OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation

OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation

OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation

RELATED ENTITIES

RELATED TOPICS