New SMART framework enhances video moment retrieval with audio and shot-aware compression

By PulseAugur Editorial · [1 sources] · 2026-06-09 04:00

Researchers have developed SMART, a new framework for video moment retrieval that enhances multimodal understanding by integrating audio cues with visual information. This approach utilizes a Multimodal Large Language Model (MLLM) and employs a novel "Shot-aware Token Compression" technique to selectively retain important information within each video shot, thereby preserving fine-grained temporal details. Evaluations on standard benchmarks like Charades-STA and QVHighlights demonstrated SMART's effectiveness, showing significant improvements over existing state-of-the-art methods. AI

IMPACT Improves video understanding capabilities, potentially enhancing applications like video search and content analysis.

RANK_REASON The cluster contains an academic paper detailing a new method and benchmark results. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New SMART framework enhances video moment retrieval with audio and shot-aware compression

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · An Yu, Weiheng Lu, Jian Li, Zhenfei Zhang, Yunhang Shen, Felix X. -F. Ye, Ming-Ching Chang · 2026-06-09 04:00

SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM

arXiv:2511.14143v2 Announce Type: replace-cross Abstract: Video Moment Retrieval is a task in video understanding that aims to localize a specific temporal segment in an untrimmed video based on a natural language query. Despite recent progress in moment retrieval from videos usi…

COVERAGE [1]

SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM

RELATED ENTITIES

RELATED TOPICS