MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue
Researchers have introduced MM-Conv, a new benchmark designed to improve how AI systems understand and ground language within dynamic 3D environments during conversations. This benchmark utilizes egocentric VR interaction data, capturing synchronized speech, motion, gaze, and 3D scene geometry over 6.7 hours. A novel two-stage grounding pipeline is proposed, which first resolves conversational ambiguity before performing visual localization, leading to significant performance gains. AI
IMPACT Enhances AI's ability to understand and act upon conversational references in complex, dynamic 3D environments.