Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 1d

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

Researchers have developed a new framework to evaluate the semantic stability of clinical Large Language Models (LLMs). This framework uses Natural Language Inference (NLI) to filter prompt variations that preserve clinical meaning, addressing the risk of LLMs producing inconsistent diagnoses due to subtle linguistic changes. The study evaluated 16 LLMs, finding that domain specialization does not consistently guarantee improved robustness, with some general-purpose models remaining competitive. AI

IMPACT Highlights critical safety concerns for LLMs in healthcare, emphasizing the need for robust evaluation beyond simple semantic similarity.

LLMs
MedQA
DiagnosisQA
Natural Language Inference (NLI)
Mahdi Alkaeed Khalaf