PulseAugur
实时 10:29:07
English(EN) Native Active Perception as Reasoning for Omni-Modal Understanding

OmniAgent 使用主动感知进行高效视频理解 · 已追踪 2 个来源

研究人员推出 OmniAgent,这是一种新颖的全模态智能体,用于视频理解,它利用基于部分可观察马尔可夫决策过程 (POMDP) 的迭代式观察-思考-行动 (Observation-Thought-Action) 循环。这种方法允许智能体选择性地将视听线索提炼成文本记忆,从而将推理复杂性与原始视频时长解耦,提高计算效率。该论文详细介绍了两种关键的训练方法:用于引导主动感知的智能体监督微调 (Agentic Supervised Fine-Tuning) 和用于优化信用分配的带 TAURA 的智能体强化学习 (Agentic Reinforcement Learning with TAURA)。OmniAgent 在 LVBench 等基准测试中展示了最先进的性能,优于 Qwen2.5-VL-72B 等大型模型。 AI

影响 通过选择性地处理信息,引入了一种更高效的视频理解方法,有可能降低长篇内容分析的计算成本。

排序理由 该集群包含一篇详细介绍新模型和方法的学术论文。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

报道来源 [3]

  1. arXiv cs.CL TIER_1 English(EN) · Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma, Qize Yang, Yunfei Chu, Jin Xu, Junyang Lin, Chi-Wing Fu, Pheng-Ann Heng ·

    Native Active Perception as Reasoning for Omni-Modal Understanding

    arXiv:2606.19341v1 Announce Type: cross Abstract: Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive fram…

  2. Hugging Face Daily Papers TIER_1 English(EN) ·

    Native Active Perception as Reasoning for Omni-Modal Understanding

    OmniAgent is a novel omni-modal agent that addresses long video understanding by using an iterative observation-thought-action cycle with active perception, achieving superior performance compared to larger models through efficient selective processing.

  3. arXiv cs.CV TIER_1 English(EN) · Pheng-Ann Heng ·

    Native Active Perception as Reasoning for Omni-Modal Understanding

    Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre…