Researchers have introduced FineBench, a new benchmark designed to evaluate the fine-grained human activity understanding capabilities of vision-language models (VLMs). The benchmark includes nearly 200,000 question-answer pairs across 64 long-form videos, focusing on detailed actions and interactions. Evaluations showed that while proprietary models like GPT-5 performed adequately, open-source VLMs struggled with spatial reasoning and subtle movement distinctions. To address these limitations, the team also proposed FineAgent, a framework that enhances VLMs using a localizer and descriptor, demonstrating improved performance on FineBench. AI
影响 Establishes a new standard for evaluating VLM's nuanced human activity understanding, potentially driving development of more capable models.
排序理由 The cluster describes a new academic paper introducing a benchmark and a framework for evaluating and enhancing vision-language models.
AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →