Researchers have introduced GRASP, a new dataset and benchmark designed to improve multimodal large language models' (MLLMs) ability to understand social interactions in videos. GRASP connects high-level social question-answering with detailed analysis of gaze and deictic gestures across nearly 50,000 videos. The dataset also includes a novel learning signal, Social Grounding Reward (SGR), which uses these fine-grained events to train models to better identify participants in social interactions. AI
IMPACT Enhances AI's ability to interpret complex social dynamics in videos, potentially improving applications in human-computer interaction and video analysis.
RANK_REASON The cluster describes a new academic paper introducing a dataset and benchmark for AI research. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →