PulseAugur
EN
LIVE 09:32:02

LLMs show language bias in mental health evaluations

A new study published on arXiv reveals that multilingual large language models exhibit biases in mental health evaluations based on prompt language. Researchers found that prompts in Chinese elicited higher stigma scores and more conservative depression severity judgments compared to equivalent prompts in English when using models like GPT-4o and Qwen3-32B. This suggests that LLMs do not apply consistent evaluative standards across languages in sensitive domains, potentially leading to under-estimation errors in mental health assessments. AI

IMPACT Highlights the need for careful evaluation of multilingual LLMs in sensitive applications like mental health to ensure consistent and unbiased performance across languages.

RANK_REASON Academic paper detailing research findings on LLM behavior. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.CL TIER_1 English(EN) · Jiayi Xu, Xiyang Hu ·

    Language Shapes Mental Health Evaluations in Large Language Models

    arXiv:2603.06910v2 Announce Type: replace Abstract: Multilingual large language models (LLMs) are increasingly used in socially sensitive mental health contexts, including support chatbots, screening, and content moderation. This raises a reliability question: do semantically equ…