Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 4h

PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage

Researchers have developed PSEBench, a new benchmark designed to evaluate Large Language Models (LLMs) in the critical task of patient safety event triage. This benchmark utilizes a novel policy-grounded construction methodology, employing "clause cards" to break down regulatory text into auditable decision specifications. PSEBench, which includes 5,074 cases based on Minnesota's reportable adverse health events, aims to capture evidence-grounded reasoning, information seeking, and principled abstention in ambiguous situations. Initial evaluations on 15 LLMs have revealed consistent capability trends and identified areas for improvement in applying LLMs to patient safety workflows. AI

IMPACT Provides a standardized method to assess LLM reliability in high-stakes clinical safety applications.

LLMs
Minnesota
PSEBench