Researchers have evaluated the effectiveness of six common reference-free factuality metrics for long-document summarization, finding they perform inconsistently. The metrics struggled with input length limitations and long-range dependencies inherent in longer texts. Through various perturbations and analyses across different domains, the study revealed that existing metrics produce unreliable scores for semantically equivalent summaries and are particularly sensitive to information-dense claims. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Highlights limitations in current factuality metrics for long-form summarization, suggesting areas for improvement in evaluation.
RANK_REASON Academic paper evaluating existing metrics for a specific NLP task.