Brief · PulseAugur

TOOL · arXiv cs.CV English(EN) · 8h

GRACE: Boosting Video MLLMs with Grounded Action-Centric Evidence for Viewer Sentiment Prediction

Researchers have developed GRACE, a new framework designed to improve the performance of Multimodal Large Language Models (MLLMs) in predicting viewer sentiment for video advertisements. GRACE addresses the limitations of current MLLMs by extracting structured, action-centric evidence, including subject-verb-object triplets and localized visual crops of participating entities. This approach allows MLLMs to perform more precise emotional reasoning by grounding clues in specific visual elements and temporal sequences. Experiments on the Pitts dataset demonstrated that GRACE significantly enhances performance compared to baseline models like Qwen2.5-VL and Qwen3-VL, with further validation on AdsQA and TVQA datasets. AI

Hugging Face
arXiv
Qwen3-VL
MLLMs
DagsHub
Qwen2.5-VL
Grace
Pitts dataset
AdsQA
TVQA