PulseAugur
实时 11:13:37
English(EN) video-SALMONN-R$^3$: Learning to ReWatch, ReAsk, and ReAnswer for Efficient Video Understanding

新方法通过高效内存和重看能力增强流媒体视频理解 · 跟踪 6 个来源

研究人员开发了新方法,在严格的计算和内存限制下提高流媒体视频理解 (SVU) 能力。ProtoKV 是一种新颖的内存系统,将旧视频内容聚合为摘要状态,在延迟查询场景下准确率提高高达 12.5 个百分点。另外,video-SALMONN-R$^3$ 使用重看机制来定位相关片段,以实现更高效的问题解答,在计算成本较低的情况下优于基础模型。CausalMem 提供了一种无需训练的方法来构建动态、固定预算的内存库,在 LLaVA-OneVisionQwen2.5-VL 等 MLLM 上实现了显著的压缩率和准确率提升。 AI

影响 这些在高效视频理解方面的进展可以加速能够以更高的准确性和更低的计算开销处理和分析实时视频流的 AI 系统的开发和部署。

排序理由 多篇在 arXiv 上发表的研究论文,详细介绍了流媒体视频理解的新颖方法。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 6 个来源。 我们如何撰写摘要 →

新方法通过高效内存和重看能力增强流媒体视频理解 · 跟踪 6 个来源

报道来源 [6]

  1. arXiv cs.LG TIER_1 English(EN) · Le Tu Ngoc Minh (KAIST), Jinyeong Lim (KAIST), Dongsu Han (KAIST) ·

    ProtoKV: Streaming Video Understanding under Delayed Query with Summary-State Memory

    arXiv:2606.26762v1 Announce Type: cross Abstract: Streaming video understanding (SVU) must answer queries that arrive asynchronously while visual tokens stream continuously under strict GPU-memory and query-time latency budgets. A key challenge is delayed query: decisive cues may…

  2. arXiv cs.LG TIER_1 English(EN) · Dongsu Han ·

    ProtoKV:带摘要状态内存的延迟查询流式视频理解

    Streaming video understanding (SVU) must answer queries that arrive asynchronously while visual tokens stream continuously under strict GPU-memory and query-time latency budgets. A key challenge is delayed query: decisive cues may appear briefly, yet many subsequent updates occur…

  3. arXiv cs.AI TIER_1 English(EN) · Yixuan Li, Guangzhi Sun, Yudong Yang, Wei Li, Zejun MA, Chao Zhang ·

    video-SALMONN-R$^3$: Learning to ReWatch, ReAsk, and ReAnswer for Efficient Video Understanding

    arXiv:2606.24477v1 Announce Type: cross Abstract: Video large language models (LLMs) are often constrained by computation and memory budgets, leading them to use reduced frame rates and spatial resolutions, which may cause them to miss critical information for question answering …

  4. arXiv cs.AI TIER_1 English(EN) · Chao Zhang ·

    video-SALMONN-R$^3$: Learning to ReWatch, ReAsk, and ReAnswer for Efficient Video Understanding

    Video large language models (LLMs) are often constrained by computation and memory budgets, leading them to use reduced frame rates and spatial resolutions, which may cause them to miss critical information for question answering (QA). A practical and efficient solution is a two-…

  5. arXiv cs.CV TIER_1 English(EN) · Baiyang Song, Yuli Lin, Qiong Wu, Tao Chen, Jun Peng, Xiao Chen, Yiyi Zhou, Rongrong Ji ·

    Towards a Dynamic and Fixed-budget Memory Bank for Efficient Streaming Video Understanding

    arXiv:2606.25658v1 Announce Type: new Abstract: Currently, streaming video understanding is still a daunting task for existing \emph{multimodal large language models} (MLLMs). Its difficulties not only lie in handling the ever-increasing video frames, but also in the unpredictabi…

  6. arXiv cs.CV TIER_1 English(EN) · Rongrong Ji ·

    Towards a Dynamic and Fixed-budget Memory Bank for Efficient Streaming Video Understanding

    Currently, streaming video understanding is still a daunting task for existing \emph{multimodal large language models} (MLLMs). Its difficulties not only lie in handling the ever-increasing video frames, but also in the unpredictability of future video content and input instructi…