ALAS: An Automatic Latent Alignment Score for Audio Language Models
Researchers have developed ALAS, an Automatic Latent Alignment Score, to evaluate how well audio language models align audio frames with text tokens. This model- and task-agnostic metric analyzes an LLM's hidden states, comparing audio and text representations against a reference derived from Whisper. ALAS requires only a frozen forward pass and an off-the-shelf ASR reference, without needing training or a fitted classifier. When applied to four open-source Speech-LLMs, ALAS revealed that alignment depth reflects the audio-encoder design and task demands, and it can identify models that perform well without genuine audio grounding. AI
IMPACT Introduces a new metric for evaluating the audio-text alignment in Speech-LLMs, aiding in the development of more robust spoken language understanding systems.