MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks
Researchers have developed MAVEN, an agentic pipeline designed to automate the creation of high-quality structured annotations for video reasoning tasks. This pipeline synthesizes multi-scale event descriptions and supports agent-driven domain adaptation, allowing it to redesign prompts and pipeline structures without manual intervention. MAVEN was used to label over 5,300 traffic videos, and fine-tuning a model called Cosmos-Reason2-8B on this data resulted in performance surpassing Gemini 2.5 Pro and 3.1 Flash on specific evaluation sets. AI
IMPACT Automates video data annotation, potentially accelerating VLM training and improving performance on complex reasoning tasks.