LLMs show language bias in mental health evaluations

By PulseAugur Editorial · [1 sources] · 2026-06-11 04:00

A new study published on arXiv reveals that multilingual large language models exhibit biases in mental health evaluations based on prompt language. Researchers found that prompts in Chinese elicited higher stigma scores and more conservative depression severity judgments compared to equivalent prompts in English when using models like GPT-4o and Qwen3-32B. This suggests that LLMs do not apply consistent evaluative standards across languages in sensitive domains, potentially leading to under-estimation errors in mental health assessments. AI

IMPACT Highlights the need for careful evaluation of multilingual LLMs in sensitive applications like mental health to ensure consistent and unbiased performance across languages.

RANK_REASON Academic paper detailing research findings on LLM behavior. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

paper
safety

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.CL TIER_1 English(EN) · Jiayi Xu, Xiyang Hu · 2026-06-11 04:00

Language Shapes Mental Health Evaluations in Large Language Models

arXiv:2603.06910v2 Announce Type: replace Abstract: Multilingual large language models (LLMs) are increasingly used in socially sensitive mental health contexts, including support chatbots, screening, and content moderation. This raises a reliability question: do semantically equ…

COVERAGE [1]

Language Shapes Mental Health Evaluations in Large Language Models

RELATED ENTITIES

RELATED TOPICS