A new study evaluated the reliability of large language models (LLMs) in predicting psychiatric hospitalization risk. Researchers found that including medically insignificant details in patient profiles significantly increased the predicted risk scores and output variability across four audited LLMs: Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, and GPT-4o mini. The study highlights that LLM-based psychiatric assessments are sensitive to non-clinical information, underscoring the need for systematic evaluations before clinical deployment. AI
影响 Reveals potential unreliability in LLM clinical risk assessments, urging caution before deployment in sensitive areas like psychiatry.
排序理由 Academic paper detailing a new evaluation methodology for LLM reliability in a specific domain.
AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →