VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
Researchers have developed VocalParse, a new model for transcribing singing voices that utilizes a Large Audio Language Model (LALM). This model addresses limitations in current systems by jointly modeling lyrics, melody, and text-note alignments through an interleaved prompting formulation. VocalParse also employs a Chain-of-Thought strategy to first decode lyrics, which helps maintain structural integrity and improve transcription accuracy, achieving state-of-the-art results on various singing datasets. AI
IMPACT Advances singing voice transcription accuracy and scalability, potentially improving tools for music production and analysis.