Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 7h

SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM

Researchers have developed SMART, a new framework for video moment retrieval that enhances multimodal understanding by integrating audio cues with visual information. This approach utilizes a Multimodal Large Language Model (MLLM) and employs a novel "Shot-aware Token Compression" technique to selectively retain important information within each video shot, thereby preserving fine-grained temporal details. Evaluations on standard benchmarks like Charades-STA and QVHighlights demonstrated SMART's effectiveness, showing significant improvements over existing state-of-the-art methods. AI

IMPACT Improves video understanding capabilities, potentially enhancing applications like video search and content analysis.

SMART
Multimodal Large Language Model
Charades-STA
QVHighlights
An Yu