Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 3d

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

Researchers have developed a new metric called the Causal Sensitivity Score (CSS) to evaluate clinical AI systems. This metric tests how well models respond to changes in patient data by introducing five types of clinical interventions. Six leading AI models performed drastically differently when assessed with CSS compared to traditional coverage-based metrics, with one model ranking as best on CSS after being worst on the other. Notably, all tested models exhibited a safety blind spot, failing to adjust recommendations when surgery status changed, a flaw missed by existing evaluation methods. AI

IMPACT This new evaluation method could lead to more robust and safer clinical AI by exposing responsiveness deficits missed by current benchmarks.

AI
LLMs
agents
Causal Sensitivity Score
clinical AI systems
Consensus Match Score