When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening
A new study published on arXiv evaluated the performance of five large language models in psychiatric screening using a benchmark of 555 interviews. The models demonstrated varying accuracy, with GPT-4.1 Mini and GPT-5 Mini showing the most consistent results. Researchers found that LLMs tended to discount symptom evidence when patients reported preserved functioning or social support, highlighting a need for careful validation before clinical use. AI
IMPACT LLMs show potential for scalable psychiatric screening but require careful validation due to biases in evidence interpretation.