New FineBench benchmark highlights VLM struggles with human activity

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-19 13:40

Researchers have introduced FineBench, a new benchmark designed to evaluate the fine-grained human activity understanding capabilities of vision-language models (VLMs). The benchmark includes nearly 200,000 question-answer pairs across 64 long-form videos, focusing on detailed actions and interactions. Evaluations showed that while proprietary models like GPT-5 performed adequately, open-source VLMs struggled with spatial reasoning and subtle movement distinctions. To address these limitations, the team also proposed FineAgent, a framework that enhances VLMs using a localizer and descriptor, demonstrating improved performance on FineBench. AI

影响 Establishes a new standard for evaluating VLM's nuanced human activity understanding, potentially driving development of more capable models.

排序理由 The cluster describes a new academic paper introducing a benchmark and a framework for evaluating and enhancing vision-language models.

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Hung-Ting Su, Winston H. Hsu · 2026-05-22 04:00

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

arXiv:2605.19846v2 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced inte…
arXiv cs.AI TIER_1 English(EN) · Winston H. Hsu · 2026-05-19 13:40

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some r…

报道来源 [2]

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

相关实体

相关话题