English(EN) VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

VLM 在多模态评估中表现出任务依赖性不确定性，影响评分可靠性。

作者 PulseAugur 编辑部 · [3 个来源] · 2026-04-28 05:30

一篇新论文引入了保形预测，用于评估视觉语言模型（VLM）作为多模态系统的自动化裁判时的可靠性。研究表明，VLM 评估中的不确定性高度依赖于特定任务，与图像美学相比，数学推理任务显示出明显更宽、信息量更少的预测区间。这项工作还发现了一个关键问题，称为“排名-评分解耦”，即 VLM 可以准确地对响应进行排名，但无法提供可靠的绝对分数，这凸显了对更鲁棒的评估方法的需求。 AI

影响引入了一种量化 VLM 评估可靠性的方法，这对于基准测试和理解模型局限性至关重要。

排序理由学术论文，介绍了一种评估多模态人工智能系统的新方法。

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。我们如何撰写摘要 →

报道来源 [3]

arXiv stat.ML TIER_1 English(EN) · Divake Kumar, Sina Tayebati, Devashri Naik, Ranganath Krishnan, Amit Ranjan Trivedi · 2026-04-29 04:00

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

arXiv:2604.25235v1 Announce Type: cross Abstract: Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framewo…
arXiv cs.CV TIER_1 English(EN) · Amit Ranjan Trivedi · 2026-04-28 05:30

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge's point score into a cali…
arXiv stat.ML TIER_1 English(EN) · Amit Ranjan Trivedi · 2026-04-28 05:30

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge's point score into a cali…

报道来源 [3]

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

相关实体

相关话题