A new survey paper provides a comprehensive review of Audio-Visual Intelligence (AVI) within the context of large foundation models. It establishes a unified taxonomy for AVI tasks, covering understanding, generation, and interaction across audio and visual modalities. The paper synthesizes methodological foundations, datasets, benchmarks, and evaluation metrics, aiming to create a coherent framework for this rapidly evolving field. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
IMPACT Consolidates research in audio-visual intelligence, potentially accelerating development of multimodal AI systems.
RANK_REASON This is a survey paper on a research topic, not a model release or significant industry event.