Brief · PulseAugur

TOOL · Hugging Face Daily Papers English(EN) · 1mo

VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

Researchers have developed VocalParse, a new model for transcribing singing voices that utilizes a Large Audio Language Model (LALM). This model addresses limitations in current systems by jointly modeling lyrics, melody, and text-note alignments through an interleaved prompting formulation. VocalParse also employs a Chain-of-Thought strategy to first decode lyrics, which helps maintain structural integrity and improve transcription accuracy, achieving state-of-the-art results on various singing datasets. AI

IMPACT Advances singing voice transcription accuracy and scalability, potentially improving tools for music production and analysis.

VocalParse
Large Audio Language Model
Singing Voice Synthesis