From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
Researchers have investigated the internal information flow within multimodal large language models (MLLMs) that process both audio and visual data. Their study, focusing on Audio-Visual Large Language Models (AVLLMs), reveals how these models route and integrate sensory inputs to generate responses. The findings indicate that information follows sequential pathways for video-based inputs and shifts to parallel streams for interleaved audio-visual items, with redundant information being discarded to improve efficiency. AI
IMPACT Provides insights into the internal workings of AVLLMs, potentially guiding future interpretability and efficiency improvements.