When the Gold Standard Isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content
Researchers have identified significant challenges in evaluating the translation of user-generated content (UGC) due to its inherent non-standard language. They developed a taxonomy of twelve non-standard phenomena and five translation actions to analyze how different datasets handle UGC, revealing a spectrum of standardness in reference translations. The study found that large language models' translation scores are sensitive to specific instructions and improve when aligned with dataset guidelines, advocating for guideline-aware evaluation frameworks. AI
IMPACT Highlights the need for more nuanced evaluation metrics for LLMs handling diverse language inputs.