Researchers have developed a new benchmark to evaluate how well large language models can screen for psychiatric conditions using patient interviews. The study found that while models like GPT-4.1 Mini and GPT-5 Mini showed some accuracy, their performance varied across different disorders and demographic groups. Notably, the models tended to discount symptom evidence if patients reported preserved functioning or social support, suggesting a need for careful validation before clinical use. AI
Summary written by gemini-2.5-flash-lite from 1 sources. How we write summaries →
IMPACT LLMs show potential for scalable psychiatric screening but require careful validation due to biases in interpreting symptom evidence.
RANK_REASON Academic paper introducing a new benchmark and evaluation of LLMs for psychiatric screening. [lever_c_demoted from research: ic=1 ai=1.0]