Assessing Factual Music Comprehension in Large Audio Language Models
Researchers have developed a new protocol to accurately assess the factual music comprehension of large audio language models (LALMs). The existing MusicQA dataset was found to be insufficient for measuring the factual correctness of LALM responses. The new protocol prompts LALMs for verifiable information and parses their open-ended answers into a structured format for objective evaluation using precision, recall, and F1 scores. This protocol was used to benchmark nine LALMs, including Gemini and Music Flamingo, across six factual information retrieval tasks on three datasets. AI
IMPACT Establishes a more rigorous method for evaluating audio LLMs, potentially driving improvements in their factual accuracy for music-related queries.