Brief · PulseAugur

RESEARCH · arXiv cs.CV English(EN) · 4d · [2 sources]

Audio-Visual Exchange-Aware Token Pruning for Efficient Audio-Visual Captioning

Researchers have developed AVEX-Prune, a novel reinforcement learning-based method for efficiently pruning tokens in audio-visual large language models. This technique uses an audio-visual token exchange strategy to identify and retain the most valuable tokens, even those near decision boundaries. AVEX-Prune maintains high captioning quality while reducing token count by 60%, demonstrating strong performance on models like VILA 1.5-8B and VideoLLaMA 2. AI

IMPACT Reduces computational load for audio-visual LLMs, potentially enabling faster and more efficient captioning.

AVEX-Prune
VILA 1.5-8B
VideoLLaMA 2