PulseAugur
EN
LIVE 08:16:17

New EMPATH benchmark evaluates emotional-support chatbot safety across languages

Researchers have introduced EMPATH, a new benchmark designed to evaluate the safety of emotional-support chatbots across multiple languages and conversational turns. Unlike static benchmarks, EMPATH simulates complex, multi-turn crisis conversations using an auditor model and scores transcripts against 19 metrics across five dimensions. Initial studies in Mexican Spanish revealed significant score inflation on many metrics and highlighted considerable variability in model performance, even with identical inputs, suggesting run-to-run reliability is a critical per-model safety property. The benchmark, its pipeline, and associated data are being released for broader use. AI

IMPACT This benchmark could lead to more robust safety evaluations for conversational AI, particularly in sensitive emotional-support applications.

RANK_REASON The cluster describes a new academic paper introducing a novel benchmark for AI safety evaluation.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New EMPATH benchmark evaluates emotional-support chatbot safety across languages

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Camilo Chac\'on Sartori ·

    EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots

    arXiv:2606.30256v1 Announce Type: new Abstract: Safety benchmarks often buy scalability by fixing the prompt, the language, and the turn structure. For emotional-support chatbots, that bargain hides precisely where safety failures emerge: across a multilingual, multi-turn crisis …

  2. arXiv cs.AI TIER_1 English(EN) · Camilo Chacón Sartori ·

    EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots

    Safety benchmarks often buy scalability by fixing the prompt, the language, and the turn structure. For emotional-support chatbots, that bargain hides precisely where safety failures emerge: across a multilingual, multi-turn crisis conversation. We present EMPATH, a benchmark for…