PulseAugur
EN
LIVE 11:05:46

New RECOM dataset reveals metric tradeoff in LLM evaluation

Researchers have introduced RECOM, a new evaluation dataset designed to assess automatic metrics for open-ended question answering, particularly for LLM-generated text. The dataset, comprising 15,000 r/AskReddit questions and their authentic community replies, highlights a tension between a metric's ability to identify genuine content alignment (validity) and its capacity to rank different models (discriminative power). Experiments show that while metrics like cosine similarity excel at validity, they struggle with discrimination, and metrics like BERTScore precision show promise in ranking but have weaker validity. The study suggests that this tradeoff is inherent to the metrics themselves, stemming from their representation design, and recommends reporting metrics along both axes with a random baseline. AI

IMPACT Highlights limitations in current LLM evaluation metrics, potentially guiding the development of more robust assessment tools.

RANK_REASON The cluster describes a new research paper introducing a novel dataset and evaluation methodology for LLMs.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.CL TIER_1 English(EN) · Pushwitha Krishnappa, Amit Das, Vinija Jain, Aman Chadha, Tathagata Mukherjee ·

    RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

    arXiv:2606.19218v1 Announce Type: new Abstract: Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discri…

  2. arXiv cs.CL TIER_1 English(EN) · Tathagata Mukherjee ·

    RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

    Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discriminative power). On open-ended, opinion-driven q…