A new benchmark called SPLIT has been developed to evaluate the cross-lingual empathy and cultural grounding of Large Language Models (LLMs) in crisis-related situations, specifically focusing on English and Ukrainian. The benchmark includes 500 prompts across five categories: Stress, Panic, Loneliness, Internal Displacement, and Tension. Evaluations of Gemini 2.5-Flash and Llama 3.3 70B Instruct showed a degradation in performance when handling Ukrainian, while DeepSeek-V3 maintained stability. The study also noted that human and AI evaluators have weak agreement on empathy and naturalness but diverge on cultural grounding, suggesting that generating Ukrainian text does not equate to providing culturally appropriate emotional support. AI
IMPACT This benchmark could drive the development of more culturally sensitive and empathetic LLMs for crisis support in low-resource languages.
RANK_REASON The cluster contains an academic paper introducing a new benchmark for LLM evaluation. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →