English(EN) RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

新的RECOM数据集揭示了LLM评估中指标的权衡

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-17 15:55

研究人员推出了RECOM，这是一个新的评估数据集，旨在评估开放式问答的自动指标，特别是针对LLM生成的文本。该数据集包含15,000个r/AskReddit问题及其真实的社区回复，突显了指标识别真实内容一致性（有效性）的能力与其对不同模型进行排名（区分能力）的能力之间的紧张关系。实验表明，虽然余弦相似度等指标在有效性方面表现出色，但在区分能力方面却表现不佳，而BERTScore精确率等指标在排名方面显示出潜力，但有效性较弱。研究表明，这种权衡是指标本身的固有属性，源于它们的表示设计，并建议沿两个轴报告指标，并包含一个随机基线。 AI

影响强调了当前LLM评估指标的局限性，可能指导更强大的评估工具的开发。

排序理由该集群描述了一篇介绍LLM新数据集和评估方法的新研究论文。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.CL TIER_1 English(EN) · Pushwitha Krishnappa, Amit Das, Vinija Jain, Aman Chadha, Tathagata Mukherjee · 2026-06-18 04:00

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

arXiv:2606.19218v1 Announce Type: new Abstract: Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discri…
arXiv cs.CL TIER_1 English(EN) · Tathagata Mukherjee · 2026-06-17 15:55

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discriminative power). On open-ended, opinion-driven q…

报道来源 [2]

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

相关实体

相关话题