A new study published on arXiv reveals that multilingual large language models exhibit biases in mental health evaluations based on prompt language. Researchers found that prompts in Chinese elicited higher stigma scores and more conservative depression severity judgments compared to equivalent prompts in English when using models like GPT-4o and Qwen3-32B. This suggests that LLMs do not apply consistent evaluative standards across languages in sensitive domains, potentially leading to under-estimation errors in mental health assessments. AI
IMPACT Highlights the need for careful evaluation of multilingual LLMs in sensitive applications like mental health to ensure consistent and unbiased performance across languages.
RANK_REASON Academic paper detailing research findings on LLM behavior. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →