CHALIS: A Challenge Dataset for Language Identification in Difficult Scenarios
Researchers have introduced CHALIS, a new dataset designed to test language identification systems in challenging scenarios. The dataset includes examples of closely related languages and text with orthographic noise, such as transliteration and internet slang. Evaluations showed that current language identification systems struggle significantly with these difficult cases, particularly for lower-resource languages and noisy inputs. AI
IMPACT Highlights limitations in current language identification models, potentially driving research into more robust solutions for diverse linguistic inputs.