Researchers have developed PSEBench, a new benchmark designed to evaluate Large Language Models (LLMs) in the critical task of patient safety event triage. This benchmark utilizes a novel policy-grounded construction methodology, employing "clause cards" to break down regulatory text into auditable decision specifications. PSEBench, which includes 5,074 cases based on Minnesota's reportable adverse health events, aims to capture evidence-grounded reasoning, information seeking, and principled abstention in ambiguous situations. Initial evaluations on 15 LLMs have revealed consistent capability trends and identified areas for improvement in applying LLMs to patient safety workflows. AI
IMPACT Provides a standardized method to assess LLM reliability in high-stakes clinical safety applications.
RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating LLMs. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →