PulseAugur
EN
LIVE 11:31:26

New SMART framework enhances video moment retrieval with audio and shot-aware compression

Researchers have developed SMART, a new framework for video moment retrieval that enhances multimodal understanding by integrating audio cues with visual information. This approach utilizes a Multimodal Large Language Model (MLLM) and employs a novel "Shot-aware Token Compression" technique to selectively retain important information within each video shot, thereby preserving fine-grained temporal details. Evaluations on standard benchmarks like Charades-STA and QVHighlights demonstrated SMART's effectiveness, showing significant improvements over existing state-of-the-art methods. AI

IMPACT Improves video understanding capabilities, potentially enhancing applications like video search and content analysis.

RANK_REASON The cluster contains an academic paper detailing a new method and benchmark results. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · An Yu, Weiheng Lu, Jian Li, Zhenfei Zhang, Yunhang Shen, Felix X. -F. Ye, Ming-Ching Chang ·

    SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM

    arXiv:2511.14143v2 Announce Type: replace-cross Abstract: Video Moment Retrieval is a task in video understanding that aims to localize a specific temporal segment in an untrimmed video based on a natural language query. Despite recent progress in moment retrieval from videos usi…