Brief · PulseAugur

RESEARCH · arXiv cs.CL English(EN) · 6d · [2 sources]

CHALIS: A Challenge Dataset for Language Identification in Difficult Scenarios

Researchers have introduced CHALIS, a new dataset designed to test language identification systems in challenging scenarios. The dataset includes examples of closely related languages and text with orthographic noise, such as transliteration and internet slang. Evaluations showed that current language identification systems struggle significantly with these difficult cases, particularly for lower-resource languages and noisy inputs. AI

IMPACT Highlights limitations in current language identification models, potentially driving research into more robust solutions for diverse linguistic inputs.

Spanish
Portuguese
Czech
Catalan
Danish
Norwegian
Slovak
Galician