Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval
Researchers have developed Omni-Embed-Audio (OEA), a new retrieval-oriented encoder that utilizes multimodal large language models for improved audio-text retrieval. Unlike previous systems that relied on caption-style queries, OEA is designed to handle more natural search behaviors, including questions, commands, and negative queries. Experiments show OEA performs comparably to existing state-of-the-art models in text-to-audio retrieval while significantly outperforming them in text-to-text retrieval and the ability to distinguish between similar-sounding audio clips. AI
IMPACT Introduces a more robust method for audio-text retrieval, potentially improving search capabilities in multimodal AI applications.