PulseAugur
实时 20:24:09
English(EN) VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

新AI框架推动视频编辑和理解能力

研究人员推出多个新的框架和基准,以推进AI模型在视频理解和编辑方面的能力。Aurora利用一个代理框架,结合增强工具的视觉语言模型来解析原始用户视频编辑请求,并将其映射到扩散变换器的结构化编辑计划。OmniPro提供了一个全面的全主动流式视频理解基准,评估模型在音视频流中自主决定何时以及说什么的能力,重点关注音频的作用和长时鲁棒性。R3-Streaming提出了一个高效的流式视频理解框架,根据查询复杂度动态压缩内存和路由计算,在显著减少令牌数量的情况下取得了最先进的成果。VideoSeeker引入了一种使用视觉提示和代理工具调用的实例级视频理解范式,在特定任务上超越了GPT-4o和Gemini-2.5-Pro等模型。 AI

影响 这些进展推动了AI在视频处理领域的界限,使得更复杂的编辑工具和对动态视听内容的强大实时理解成为可能。

排序理由 多篇研究论文介绍了用于AI视频理解和编辑的新框架和基准。

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 5 个来源。 我们如何撰写摘要 →

新AI框架推动视频编辑和理解能力

报道来源 [5]

  1. arXiv cs.CV TIER_1 English(EN) · Renjie Liao ·

    StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation

    Although existing video editing methods are generally feasible, they often require many costly iterations and still struggle to deliver high-quality yet satisfying editing results. We attribute this limitation to the prevalent data-to-data paradigm, which is less compatible with …

  2. arXiv cs.CV TIER_1 English(EN) · Jiebo Luo ·

    Aurora: Unified Video Editing with a Tool-Using Agent

    Recent video editing models have converged on a unified conditioning design: a single diffusion transformer jointly consumes text, source video, and reference images, and one set of weights covers replacement, removal, style transfer, and reference-driven insertion. The design is…

  3. arXiv cs.CV TIER_1 English(EN) · Xirong Li ·

    OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

    Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on…

  4. arXiv cs.CV TIER_1 English(EN) · Xin Jin ·

    An Efficient Streaming Video Understanding Framework with Agentic Control

    Streaming video requires handling dynamic information density under strict latency budgets. Yet, existing methods typically employ static strategies, such as fixed memory compression or reliance on a single model, forcing a trade-off: fast models fail on complex queries, while al…

  5. arXiv cs.CV TIER_1 English(EN) · Feng Zhao ·

    VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

    Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interact…