E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs
Researchers have introduced E-VAds, a new benchmark designed to evaluate the understanding capabilities of multimodal large language models (MLLMs) specifically for e-commerce short videos. This benchmark addresses the limitations of existing datasets by focusing on the unique characteristics of commercial content, which exhibits higher density in visual, audio, and textual signals. E-VAds includes over 3,900 videos and nearly 20,000 question-answer pairs categorized into perception, cognition, and reasoning tasks. The paper also details E-VAds-R1, a novel reasoning model that demonstrates significant performance gains in identifying commercial intent. AI
IMPACT This benchmark could drive MLLM development towards better understanding and generation of commercially-oriented content.