Researchers have introduced SwissGov-RSD, a novel cross-lingual benchmark dataset designed to evaluate the recognition of semantic differences between related documents. The dataset includes 224 multi-parallel documents in English, German, French, and Italian, with human-annotated token-level difference information. Evaluations on this benchmark revealed that current large language models and encoder models perform significantly worse than on monolingual or synthetic tasks, highlighting a gap in their ability to discern semantic variations across languages. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT This new benchmark highlights limitations in current LLMs' ability to detect semantic differences across related documents, particularly in cross-lingual contexts.
RANK_REASON This is a research paper introducing a new benchmark dataset for evaluating LLMs.