PulseAugur
EN
LIVE 09:26:48
tool · [1 source] ·

LLMs show mixed results in psychiatric screening benchmark

Researchers have developed a new benchmark to evaluate how well large language models can screen for psychiatric conditions using patient interviews. The study found that while models like GPT-4.1 Mini and GPT-5 Mini showed some accuracy, their performance varied across different disorders and demographic groups. Notably, the models tended to discount symptom evidence if patients reported preserved functioning or social support, suggesting a need for careful validation before clinical use. AI

Summary written by gemini-2.5-flash-lite from 1 sources. How we write summaries →

IMPACT LLMs show potential for scalable psychiatric screening but require careful validation due to biases in interpreting symptom evidence.

RANK_REASON Academic paper introducing a new benchmark and evaluation of LLMs for psychiatric screening. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

COVERAGE [1]

  1. arXiv cs.CL TIER_1 · Jianfeng Zhu, Megan Korhummel, Ruoming Jin, Karin G. Coifman ·

    When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

    arXiv:2605.23148v1 Announce Type: new Abstract: As demand for mental health care outpaces clinician-delivered assessment, scalable screening tools are increasingly needed. Large language models (LLMs) may identify psychiatric risk from patient narratives, but their reliability ac…