Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 7h

OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs

Researchers have developed OmniMem, a new framework designed to make audio-visual large language models more memory-efficient for processing long videos. OmniMem addresses the challenge of linearly growing video tokens and KV caches by employing a modality-aware allocation strategy that distinguishes between visual and audio contexts. It also uses perturbation-aware selection to retain crucial information, preventing memory compression from degrading understanding. Experiments show OmniMem improves accuracy by 2-4% over existing methods under similar memory constraints, with further gains possible through budget-aware fine-tuning. AI

IMPACT Enhances efficiency for audio-visual LLMs, potentially enabling more sophisticated long-form video analysis and understanding.

LLMs
arXiv
video
OmniMem
video-SALMONN 2+
Qwen-2.5-Omni