PulseAugur
EN
LIVE 14:50:30

LLMs show instability in psychiatric risk scores with irrelevant data

A new study evaluated the reliability of large language models (LLMs) in predicting psychiatric hospitalization risk. Researchers found that including medically insignificant details in patient profiles significantly increased the predicted risk scores and output variability across four audited LLMs: Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, and GPT-4o mini. The study highlights that LLM-based psychiatric assessments are sensitive to non-clinical information, underscoring the need for systematic evaluations before clinical deployment. AI

IMPACT Reveals potential unreliability in LLM clinical risk assessments, urging caution before deployment in sensitive areas like psychiatry.

RANK_REASON Academic paper detailing a new evaluation methodology for LLM reliability in a specific domain.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

LLMs show instability in psychiatric risk scores with irrelevant data

COVERAGE [2]

  1. arXiv cs.LG TIER_1 English(EN) · Shevya Pandya, Shinjini Bose, Ananya Joshi ·

    Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores

    arXiv:2604.22063v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in critical and indeterminate domains such as psychiatry remains unclear. Prior work has ident…

  2. arXiv cs.AI TIER_1 English(EN) · Ananya Joshi ·

    Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores

    Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in critical and indeterminate domains such as psychiatry remains unclear. Prior work has identified algorithmic biases and prompt sensitivity …