LLM psychiatric diagnosis benchmark reveals accuracy gaps in complex cases

By PulseAugur Editorial · [1 sources] · 2026-06-11 00:00

A new benchmark, LingxiDiagBench, has been developed to evaluate Large Language Models (LLMs) in Chinese psychiatric consultation and diagnosis. The benchmark includes a dataset of 16,000 synthetic dialogues, LingxiDiag-16K, designed to mimic real clinical distributions across 12 ICD-10 categories. Experiments show that while LLMs perform well in binary classification tasks like distinguishing depression from anxiety, their accuracy significantly drops for more complex tasks such as comorbidity recognition and 12-way differential diagnosis. The study also found that dynamic multi-turn consultations can be less effective than static evaluations, indicating that LLMs' information-gathering strategies impact their diagnostic reasoning. AI

IMPACT Highlights limitations in LLM diagnostic reasoning for complex mental health conditions, indicating areas for future research and development.

RANK_REASON The cluster describes a new academic paper introducing a benchmark dataset and evaluation framework for LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

LLM psychiatric diagnosis benchmark reveals accuracy gaps in complex cases

COVERAGE [1]

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 00:00

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

A large-scale multi-agent benchmark for evaluating LLMs in Chinese psychiatric diagnosis is introduced, highlighting challenges in dynamic consultation and the gap between consultation quality and diagnostic accuracy.

COVERAGE [1]

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

RELATED ENTITIES

RELATED TOPICS