PulseAugur
实时 22:12:37

New FineBench benchmark highlights VLM struggles with human activity

Researchers have introduced FineBench, a new benchmark designed to evaluate the fine-grained human activity understanding capabilities of vision-language models (VLMs). The benchmark includes nearly 200,000 question-answer pairs across 64 long-form videos, focusing on detailed actions and interactions. Evaluations showed that while proprietary models like GPT-5 performed adequately, open-source VLMs struggled with spatial reasoning and subtle movement distinctions. To address these limitations, the team also proposed FineAgent, a framework that enhances VLMs using a localizer and descriptor, demonstrating improved performance on FineBench. AI

影响 Establishes a new standard for evaluating VLM's nuanced human activity understanding, potentially driving development of more capable models.

排序理由 The cluster describes a new academic paper introducing a benchmark and a framework for evaluating and enhancing vision-language models.

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

New FineBench benchmark highlights VLM struggles with human activity

报道来源 [2]

  1. arXiv cs.AI TIER_1 English(EN) · Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Hung-Ting Su, Winston H. Hsu ·

    FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

    arXiv:2605.19846v2 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced inte…

  2. arXiv cs.AI TIER_1 English(EN) · Winston H. Hsu ·

    FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

    Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some r…