PulseAugur
EN
LIVE 13:28:46

LLM psychiatric diagnosis benchmark reveals accuracy gaps in complex cases

A new benchmark, LingxiDiagBench, has been developed to evaluate Large Language Models (LLMs) in Chinese psychiatric consultation and diagnosis. The benchmark includes a dataset of 16,000 synthetic dialogues, LingxiDiag-16K, designed to mimic real clinical distributions across 12 ICD-10 categories. Experiments show that while LLMs perform well in binary classification tasks like distinguishing depression from anxiety, their accuracy significantly drops for more complex tasks such as comorbidity recognition and 12-way differential diagnosis. The study also found that dynamic multi-turn consultations can be less effective than static evaluations, indicating that LLMs' information-gathering strategies impact their diagnostic reasoning. AI

IMPACT Highlights limitations in LLM diagnostic reasoning for complex mental health conditions, indicating areas for future research and development.

RANK_REASON The cluster describes a new academic paper introducing a benchmark dataset and evaluation framework for LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

LLM psychiatric diagnosis benchmark reveals accuracy gaps in complex cases

COVERAGE [1]

  1. Hugging Face Daily Papers TIER_1 English(EN) ·

    LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

    A large-scale multi-agent benchmark for evaluating LLMs in Chinese psychiatric diagnosis is introduced, highlighting challenges in dynamic consultation and the gap between consultation quality and diagnostic accuracy.