PulseAugur
实时 02:05:06
English(EN) VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

VLM 在多模态评估中表现出任务依赖性不确定性,影响评分可靠性。

一篇新论文引入了保形预测,用于评估视觉语言模型(VLM)作为多模态系统的自动化裁判时的可靠性。研究表明,VLM 评估中的不确定性高度依赖于特定任务,与图像美学相比,数学推理任务显示出明显更宽、信息量更少的预测区间。这项工作还发现了一个关键问题,称为“排名-评分解耦”,即 VLM 可以准确地对响应进行排名,但无法提供可靠的绝对分数,这凸显了对更鲁棒的评估方法的需求。 AI

影响 引入了一种量化 VLM 评估可靠性的方法,这对于基准测试和理解模型局限性至关重要。

排序理由 学术论文,介绍了一种评估多模态人工智能系统的新方法。

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

VLM 在多模态评估中表现出任务依赖性不确定性,影响评分可靠性。

报道来源 [3]

  1. arXiv stat.ML TIER_1 English(EN) · Divake Kumar, Sina Tayebati, Devashri Naik, Ranganath Krishnan, Amit Ranjan Trivedi ·

    VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

    arXiv:2604.25235v1 Announce Type: cross Abstract: Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framewo…

  2. arXiv cs.CV TIER_1 English(EN) · Amit Ranjan Trivedi ·

    VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

    Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge's point score into a cali…

  3. arXiv stat.ML TIER_1 English(EN) · Amit Ranjan Trivedi ·

    VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

    Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge's point score into a cali…