EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models
Researchers have introduced EvoVid, a novel framework designed to enhance Video Large Language Models (Video-LLMs) through temporal-centric self-evolution. Unlike previous self-evolving methods that are limited to static data, EvoVid enables Video-LLMs to learn directly from raw, unannotated videos by focusing on temporal dynamics. The framework incorporates specialized rewards for question generation and video segment localization, leading to consistent performance improvements across multiple benchmarks and base models. AI
IMPACT Enables Video-LLMs to improve directly from unannotated videos, potentially reducing reliance on costly human supervision.