English(EN) ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

新基准显示 LALM 裁判在副语言评估方面落后于人类

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-23 14:43

研究人员开发了 ParaPairAudioBench，这是一个旨在评估大型音频语言模型 (LALM) 在区分语音中细微副语言特征能力的新基准。该基准包含 5,175 个音频对，涵盖五个维度：风格、语速、强调、年龄和性别。目前的 LALM 裁判表现明显不如人类评估，平均落后 32 个百分点，并且在校准方面存在困难，尤其是在正确判断应弃权时。 AI

影响该基准突显了当前 LALM 在细致语音评估方面的局限性，可能指导未来研究朝着更符合人类的音频评估方向发展。

排序理由该集群描述了一篇介绍用于评估 AI 模型基准的新学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-23 14:43

ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained paralinguistic distinctions underexplored. We introduce ParaPair…

报道来源 [1]

ParaPairAudioBench: Paralinguistic Pairwise Audio Benchmark for LALM-as-a-Judge

相关实体

相关话题