Researchers have introduced RECOM, a new evaluation dataset designed to assess automatic metrics for open-ended question answering, particularly for LLM-generated text. The dataset, comprising 15,000 r/AskReddit questions and their authentic community replies, highlights a tension between a metric's ability to identify genuine content alignment (validity) and its capacity to rank different models (discriminative power). Experiments show that while metrics like cosine similarity excel at validity, they struggle with discrimination, and metrics like BERTScore precision show promise in ranking but have weaker validity. The study suggests that this tradeoff is inherent to the metrics themselves, stemming from their representation design, and recommends reporting metrics along both axes with a random baseline. AI
IMPACT Highlights limitations in current LLM evaluation metrics, potentially guiding the development of more robust assessment tools.
RANK_REASON The cluster describes a new research paper introducing a novel dataset and evaluation methodology for LLMs.
- arXiv
- BERTScore: Evaluating text generation with BERT
- Hugging Face
- LLM
- Pushwitha Krishnappa
- /r/AskReddit
- cosine similarity
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →