Bridging Traditional Explainability Methods and Multimodal Multilingual Models: An XAI-Based Analysis
Researchers have developed a novel extension of Shapley Values to explain the behavior of multimodal multilingual models (MLLMs). This framework addresses the challenges of integrating text and audio data by treating them as cooperative features and employing efficient estimation strategies for computational feasibility. The approach includes a new preprocessing method, Spectrogram-Guided Phonetic Alignment (SGPA), to align audio segments with text, and provides an open-source package with a GUI for visualization. Experiments on datasets like VoiceBench and Infinity Instruct show that input modality significantly impacts attributions, and standard importance proxies are insufficient for multimodal, cross-lingual contexts. AI
IMPACT Provides a new method for understanding and potentially debugging complex multimodal AI systems.