RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering
Researchers have introduced RECOM, a new evaluation dataset designed to assess automatic metrics for open-ended question answering, particularly for LLM-generated text. The dataset, comprising 15,000 r/AskReddit questions and their authentic community replies, highlights a tension between a metric's ability to identify genuine content alignment (validity) and its capacity to rank different models (discriminative power). Experiments show that while metrics like cosine similarity excel at validity, they struggle with discrimination, and metrics like BERTScore precision show promise in ranking but have weaker validity. The study suggests that this tradeoff is inherent to the metrics themselves, stemming from their representation design, and recommends reporting metrics along both axes with a random baseline. AI
IMPACT Highlights limitations in current LLM evaluation metrics, potentially guiding the development of more robust assessment tools.