English(EN) VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

新AI框架推动视频编辑和理解能力

作者 PulseAugur 编辑部 · [5 个来源] · 2026-05-15 15:43

研究人员推出多个新的框架和基准，以推进AI模型在视频理解和编辑方面的能力。Aurora利用一个代理框架，结合增强工具的视觉语言模型来解析原始用户视频编辑请求，并将其映射到扩散变换器的结构化编辑计划。OmniPro提供了一个全面的全主动流式视频理解基准，评估模型在音视频流中自主决定何时以及说什么的能力，重点关注音频的作用和长时鲁棒性。R3-Streaming提出了一个高效的流式视频理解框架，根据查询复杂度动态压缩内存和路由计算，在显著减少令牌数量的情况下取得了最先进的成果。VideoSeeker引入了一种使用视觉提示和代理工具调用的实例级视频理解范式，在特定任务上超越了GPT-4o和Gemini-2.5-Pro等模型。 AI

影响这些进展推动了AI在视频处理领域的界限，使得更复杂的编辑工具和对动态视听内容的强大实时理解成为可能。

排序理由多篇研究论文介绍了用于AI视频理解和编辑的新框架和基准。

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 5 个来源。我们如何撰写摘要 →

报道来源 [5]

arXiv cs.CV TIER_1 English(EN) · Renjie Liao · 2026-05-20 17:52

StreamGVE：通过几步流式视频生成实现无需训练的视频编辑

Although existing video editing methods are generally feasible, they often require many costly iterations and still struggle to deliver high-quality yet satisfying editing results. We attribute this limitation to the prevalent data-to-data paradigm, which is less compatible with …
arXiv cs.CV TIER_1 English(EN) · Jiebo Luo · 2026-05-18 17:59

Aurora：使用工具的统一视频编辑代理

Recent video editing models have converged on a unified conditioning design: a single diffusion transformer jointly consumes text, source video, and reference images, and one set of weights covers replacement, removal, style transfer, and reference-driven insertion. The design is…
arXiv cs.CV TIER_1 English(EN) · Xirong Li · 2026-05-18 15:55

OmniPro：全方位主动式流媒体视频理解的综合基准

Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on…
arXiv cs.CV TIER_1 English(EN) · Xin Jin · 2026-05-18 06:29

一种具有智能体控制的高效流式视频理解框架

Streaming video requires handling dynamic information density under strict latency budgets. Yet, existing methods typically employ static strategies, such as fixed memory compression or reliance on a single model, forcing a trade-off: fast models fail on complex queries, while al…
arXiv cs.CV TIER_1 English(EN) · Feng Zhao · 2026-05-15 15:43

VideoSeeker：通过原生代理工具调用激励实例级视频理解

Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interact…

报道来源 [5]

StreamGVE：通过几步流式视频生成实现无需训练的视频编辑

Aurora：使用工具的统一视频编辑代理

OmniPro：全方位主动式流媒体视频理解的综合基准

一种具有智能体控制的高效流式视频理解框架

VideoSeeker：通过原生代理工具调用激励实例级视频理解

相关实体

相关话题