Researchers have introduced MedHorizon, a new benchmark designed to test multimodal large language models (MLLMs) on understanding long-form medical videos. This benchmark includes 759 hours of clinical procedures and 1,253 questions, focusing on the challenge of identifying sparse, crucial evidence within lengthy and often redundant visual data. Current models struggle significantly, with the best achieving only 41.1% accuracy, highlighting major bottlenecks in evidence retrieval and clinical reasoning over complete workflows. AI
影响 Establishes a new, challenging benchmark for medical video understanding, pushing the development of MLLMs for complex clinical reasoning.
排序理由 The cluster describes a new academic paper introducing a benchmark for AI model evaluation.
AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →