Brief · PulseAugur

RESEARCH · arXiv cs.CL English(EN) · 4d · [2 sources]

When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

A new study published on arXiv evaluated the performance of five large language models in psychiatric screening using a benchmark of 555 interviews. The models demonstrated varying accuracy, with GPT-4.1 Mini and GPT-5 Mini showing the most consistent results. Researchers found that LLMs tended to discount symptom evidence when patients reported preserved functioning or social support, highlighting a need for careful validation before clinical use. AI

IMPACT LLMs show potential for scalable psychiatric screening but require careful validation due to biases in evidence interpretation.

GPT-4.1 Mini
Large Language Models
GPT-5 Mini
arXiv