Brief · PulseAugur

RESEARCH · arXiv cs.AI English(EN) · 18h · [2 sources]

Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering

A new research paper published on arXiv explores the limitations of large language models in clinical question answering. The study found that models like Claude Sonnet, GPT-4o, and GPT-5.4-2026-03-05 exhibit a significant decline in accuracy as the complexity of reasoning required for a clinical question increases. This decline is attributed to the inherent compositional reasoning limits of transformer architectures, rather than issues with EHR data truncation. AI

IMPACT Highlights potential risks in deploying clinical AI by showing accuracy degrades with question complexity, suggesting a need for careful risk stratification.

OpenAI
GPT-4o
Claude Sonnet
GPT-4
arXiv
GPT-5.4-2026-03-05
MedAlign