SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM
Researchers have developed SMART, a new framework for video moment retrieval that enhances multimodal understanding by integrating audio cues with visual information. This approach utilizes a Multimodal Large Language Model (MLLM) and employs a novel "Shot-aware Token Compression" technique to selectively retain important information within each video shot, thereby preserving fine-grained temporal details. Evaluations on standard benchmarks like Charades-STA and QVHighlights demonstrated SMART's effectiveness, showing significant improvements over existing state-of-the-art methods. AI
IMPACT Improves video understanding capabilities, potentially enhancing applications like video search and content analysis.