Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text
A new benchmark has been developed to evaluate how well large language models (LLMs) preserve diagnostic uncertainty in clinical text. Researchers found that current LLMs often fail to maintain the original level of uncertainty, sometimes preserving it less than half the time. The study highlights a critical failure mode for LLMs in clinical settings, as altering uncertainty expressions can significantly change clinical meaning and impact treatment decisions. AI
IMPACT Highlights a critical failure mode for LLMs in clinical workflows, impacting safe deployment and treatment decisions.