PulseAugur
EN
LIVE 10:47:27

VideoNet dataset challenges vision-language models on domain-specific action recognition

Researchers have introduced VideoNet, a large-scale dataset designed to improve domain-specific action recognition in videos. The benchmark, covering 1,000 actions across 37 domains, highlights current limitations in vision-language models (VLMs) like Gemini 3.1 Pro and Qwen3-VL-8B, which struggle with accuracy and few-shot learning on these tasks. To address this, a new training dataset of nearly 500,000 video question-answer pairs was created, enabling a fine-tuned Molmo2-4B model to outperform existing open-weight 8B models on VideoNet. AI

IMPACT Revitalizes action recognition research, potentially improving VLM capabilities in specialized video understanding tasks.

RANK_REASON The cluster contains a new academic paper introducing a dataset and benchmark for action recognition.

Read on arXiv cs.CV →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

VideoNet dataset challenges vision-language models on domain-specific action recognition

COVERAGE [2]

  1. arXiv cs.CV TIER_1 English(EN) · Tanush Yadav, Mohammadreza Salehi, Jae Sung Park, Vivek Ramanujan, Hannaneh Hajishirzi, Yejin Choi, Ali Farhadi, Rohun Tripathi, Ranjay Krishna ·

    VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

    arXiv:2605.02834v1 Announce Type: new Abstract: Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently …

  2. arXiv cs.CV TIER_1 English(EN) · Ranjay Krishna ·

    VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

    Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-lang…