Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages
Researchers have developed a new benchmark called ITEM to evaluate the reliability of automatic metrics for machine translation and summarization in Indian languages. The study found that LLM-based evaluators performed best in aligning with human judgments, while outliers significantly impacted metric agreement. The research also highlighted differences in how metrics capture fluency versus content fidelity across translation and summarization tasks, and noted variations in metric robustness to perturbations. AI
IMPACT Provides critical guidance for improving evaluation metrics in machine translation and summarization for under-resourced languages.