AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs
Researchers have introduced AVI-Bench, a new benchmark designed to evaluate the audio-visual intelligence of Omni-Multimodal Large Language Models (Omni-MLLMs). This benchmark assesses models across perception, understanding, and reasoning stages using tasks that require joint audio-visual interpretation. An extension, AVI-Bench-PriSe, further tests robustness with unfamiliar stimuli to gauge generalization beyond typical training data. Experiments indicate current Omni-MLLMs have significant limitations in audio-visual intelligence. AI
IMPACT Provides a new framework for evaluating and improving the audio-visual capabilities of multimodal AI models.