Researchers have developed PaSBench-Video, a new benchmark designed to evaluate the proactive safety warning capabilities of video-capable multimodal large language models (MLLMs). The benchmark consists of 740 videos across driving, healthcare, daily life, and industrial production, with annotations for risk onset and accident boundaries. Testing 13 MLLMs revealed that current models struggle with temporal calibration and false-positive rates, indicating a reliance on scene-level cues rather than genuine harm reasoning. AI
IMPACT Highlights limitations in current AI video analysis for safety applications, suggesting a need for models that reason about emerging harm rather than just scene activity.
RANK_REASON The cluster contains a research paper introducing a new benchmark for evaluating AI models.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →