HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

作者 PulseAugur 编辑部 · [5 个来源] · 2026-04-23 09:04

研究人员开发了新的框架来提高AI模型在视频理解和推理方面的能力。StoryTR引入了一个关注“心智理论”的基准和训练方法，用于推断叙事因果关系，表明推理能力比模型规模更关键。HiCrew采用一种分层多智能体方法，通过问询感知协作来处理长视频，以保持时间连贯性并适应推理策略。UpstreamQA提出了一个模块化框架，解耦推理组件，使用大型推理模型来丰富下游视频问答模型的输入，从而提高性能和可解释性。Find, Fix, Reason引入了一种上下文修复方法，其中教师模型通过提供缺失的时空依赖来指导学生模型，以提高视频推理的准确性和泛化能力。 AI

影响视频推理框架的进步可能导致更复杂的AI代理，能够理解视觉数据中的复杂叙事和因果关系。

排序理由该集群包含多篇学术论文，介绍了用于视频理解和推理的新模型、基准和框架。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 5 个来源。我们如何撰写摘要 →

HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

报道来源 [5]

arXiv cs.AI TIER_1 English(EN) · Xuanyue Zhong, Yuqiang Xie, Guanqun Bi, Jiangping Yang, Guibin Chen · 2026-04-28 04:00

StoryTR：基于心智理论推理的面向叙事的视频时序检索

arXiv:2604.23198v1 Announce Type: new Abstract: Current video moment retrieval excels at action-centric tasks but struggles with narrative content. Models can see \textit{what is happening} but fail to reason \textit{why it matters}. This semantic gap stems from the lack of \text…
arXiv cs.AI TIER_1 English(EN) · Baoquan Zhao · 2026-04-23 09:04

HiCrew：通过问答式多智能体协作进行长视频理解的分层推理

Long-form video understanding remains fundamentally challenged by pervasive spatiotemporal redundancy and intricate narrative dependencies that span extended temporal horizons. While recent structured representations compress visual information effectively, they frequently sacrif…
arXiv cs.CV TIER_1 English(EN) · Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, Yan Feng, Peng Pei, Xunliang Cai, Xiangyu Yue · 2026-04-29 04:00

OneThinker：图像和视频的一体化推理模型

arXiv:2512.03043v3 Announce Type: replace Abstract: Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks…
arXiv cs.CV TIER_1 English(EN) · Jason Nguyen, Ameet Rao, Alexander Chang, Ishaan Kumar, Erin Tan · 2026-04-28 04:00

UpstreamQA：面向视频问答任务的显式推理模块化框架

arXiv:2604.23145v1 Announce Type: new Abstract: Video Question Answering (VideoQA) demands models that jointly reason over spatial, temporal, and linguistic cues. However, the task's inherent complexity often requires multi-step reasoning that current large multimodal models (LMM…
arXiv cs.CV TIER_1 English(EN) · Haojian Huang, Chuanyu Qin, Yinchuan Li, Yingcong Chen · 2026-04-28 04:00

查找、修复、推理：视频推理的上下文修复

arXiv:2604.16243v2 Announce Type: replace Abstract: Reinforcement learning has advanced video reasoning in large multi-modal models, yet dominant pipelines either rely on on-policy self-exploration, which plateaus at the model's knowledge boundary, or hybrid replay that mixes pol…

报道来源 [5]

StoryTR：基于心智理论推理的面向叙事的视频时序检索

HiCrew：通过问答式多智能体协作进行长视频理解的分层推理

OneThinker：图像和视频的一体化推理模型

UpstreamQA：面向视频问答任务的显式推理模块化框架

查找、修复、推理：视频推理的上下文修复

相关实体

相关话题