New EMPATH benchmark evaluates emotional-support chatbot safety across languages

By PulseAugur Editorial · [2 sources] · 2026-06-29 13:05

Researchers have introduced EMPATH, a new benchmark designed to evaluate the safety of emotional-support chatbots across multiple languages and conversational turns. Unlike static benchmarks, EMPATH simulates complex, multi-turn crisis conversations using an auditor model and scores transcripts against 19 metrics across five dimensions. Initial studies in Mexican Spanish revealed significant score inflation on many metrics and highlighted considerable variability in model performance, even with identical inputs, suggesting run-to-run reliability is a critical per-model safety property. The benchmark, its pipeline, and associated data are being released for broader use. AI

IMPACT This benchmark could lead to more robust safety evaluations for conversational AI, particularly in sensitive emotional-support applications.

RANK_REASON The cluster describes a new academic paper introducing a novel benchmark for AI safety evaluation.

Read on arXiv cs.AI →

paper
safety

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New EMPATH benchmark evaluates emotional-support chatbot safety across languages

COVERAGE [2]

arXiv cs.AI TIER_1 English(EN) · Camilo Chac\'on Sartori · 2026-06-30 04:00

EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots

arXiv:2606.30256v1 Announce Type: new Abstract: Safety benchmarks often buy scalability by fixing the prompt, the language, and the turn structure. For emotional-support chatbots, that bargain hides precisely where safety failures emerge: across a multilingual, multi-turn crisis …
arXiv cs.AI TIER_1 English(EN) · Camilo Chacón Sartori · 2026-06-29 13:05

EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots

Safety benchmarks often buy scalability by fixing the prompt, the language, and the turn structure. For emotional-support chatbots, that bargain hides precisely where safety failures emerge: across a multilingual, multi-turn crisis conversation. We present EMPATH, a benchmark for…

COVERAGE [2]

EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots

EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots

RELATED ENTITIES

RELATED TOPICS