新架构支持实时视频理解

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 13:24

OR-Action：具有细粒度动作的多角色视频理解

Fine-grained understanding of operating room (OR) activity could enable workflow-aware assistance, yet remains difficult due to clutter, occlusions, and limited sensing. The prevailing approach to model this environment is scene graphs as an interpretable representation of OR int…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 14:19

Q-Fold：查询感知焦点-上下文时空折叠用于长视频理解

Long-video understanding remains challenging for multimodal large language models, because temporally extended videos often contain thousands of frames and are therefore expensive to process exhaustively. Existing methods usually construct compact visual inputs from long videos u…

arXiv cs.AI TIER_1 English(EN) · Shuning Wang, Zhiheng Wu, YiNuo Lu, Naiming Liu, Chen Jia, Bowen Liu, Shuo Nie, Weijie Zhu, Yumeng Zhang · 2026-06-09 04:00

看得更广，想得更深：查询扩展视觉证据与答案线索引导反思用于长视频理解

arXiv:2606.09064v1 Announce Type: cross Abstract: Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understanding tasks. However, existing methods still face two key limitations: evidence acquisition often relies on a single search…

arXiv cs.AI TIER_1 English(EN) · Yiheng Wang, Yueqian Lin, Lichen Zhu, Yudong Liu, Hai "Helen" Li, Yiran Chen · 2026-06-09 04:00

当无正确答案时：诊断 MLLMs 在视频理解中的缺失答案检测

arXiv:2606.08239v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have made substantial advancements in video understanding, yet the reliability of their responses remains underexplored. This work presents a diagnostic study of absent answer detection for M…

arXiv cs.AI TIER_1 English(EN) · Pengyu Wang, Chenkun Tan, Shaojun Zhou, Wei Huang, Qirui Zhou, Zhan Huang, Zhen Ye, Jijun Cheng, Xiaomeng Qian, Yanxin Chen, Xingyang He, Huazheng Zeng, Chenghao Wang, Pengfei Wang, Hongkai Wang, Shanqing Gao, Yixian Tian, Chenghao Liu, Xinghao Wang, Bot… · 2026-06-09 04:00

MOSS-Video-Preview：迈向通过交叉注意力实现实时视频理解

arXiv:2606.07639v1 Announce Type: cross Abstract: Video understanding is shifting from the offline paradigm -- taking a fully recorded video as input and producing a single answer after it ends -- toward real-time interaction, in which the model perceives new frames while still r…

arXiv cs.AI TIER_1 English(EN) · Lei Wang, Syuan-Hao Li, Piotr Koniusz, Yongsheng Gao · 2026-06-09 04:00

Video Understanding by Design: How Datasets Shape Video Models

arXiv:2509.09151v2 Announce Type: replace-cross Abstract: Research in video understanding has advanced rapidly, driven by increasingly diverse datasets and more powerful model architectures. While existing surveys typically organize progress by tasks, benchmarks, or model familie…

arXiv cs.AI TIER_1 English(EN) · Jiaxin Dai, Zehang Wei, Jiamin Yan, Xiang Xiang · 2026-06-09 04:00

解耦语义与逻辑：一种无需训练的粗粒度到细粒度视频检索增强生成管线

arXiv:2606.07924v1 Announce Type: cross Abstract: This paper presents our system description for the 2nd Workshop on Multimodal Augmented Generation via MultimodAl Retrieval (MAGMaR). Addressing the critical challenges of cross-lingual long-video comprehension, strict persona adh…

arXiv cs.AI TIER_1 English(EN) · Zhenyu Yang, Kairui Zhang, Shengsheng Qian, Weiming Dong, Changsheng Xu · 2026-06-08 04:00

不要暂停：在线视频理解的流式视频-语言同步

arXiv:2606.06991v1 Announce Type: cross Abstract: Online Video Large Language Models (Video-LLMs) have advanced toward seamless human-AI interaction through frame-by-frame processing and proactive responding. However, a critical challenge remains in streaming scenarios: existing …

arXiv cs.AI TIER_1 English(EN) · Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao, Weisong Liu, Yanwei Li, Jason Li, Lingdong Kong, Haochen Wang, Qianyu Zhou, Jiangning Zhang, Guangliang Cheng, Yunhai Tong, Lu Qi, Minghsuan Yang · 2026-06-08 04:00

观看、记忆、推理：使用 MLLMs 进行人类视角的视频理解

arXiv:2606.07433v1 Announce Type: cross Abstract: Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handl…

arXiv cs.AI TIER_1 English(EN) · Cong Chen, Guo Gan, Kaixiang Ji, ChaoYang Zhang, Zhen Yang, Guangming Yao, Hao Chen, Jingdong Chen, Yi Yuan, Chunhua Shen · 2026-06-08 04:00

MemDreamer：通过分层图记忆和代理检索机制解耦感知与推理，实现长视频理解

arXiv:2606.07512v1 Announce Type: cross Abstract: Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perce…

arXiv cs.AI TIER_1 English(EN) · Yiran Chen · 2026-06-06 15:51

当没有正确答案时：诊断 MLLMs 在视频理解中的缺失答案检测

Multimodal large language models (MLLMs) have made substantial advancements in video understanding, yet the reliability of their responses remains underexplored. This work presents a diagnostic study of absent answer detection for MLLMs in video understanding, where the correct a…

arXiv cs.CL TIER_1 English(EN) · Xiang Xiang · 2026-06-06 01:17

解耦语义与逻辑：一种无需训练的粗粒度到细粒度视频检索增强生成管线

This paper presents our system description for the 2nd Workshop on Multimodal Augmented Generation via MultimodAl Retrieval (MAGMaR). Addressing the critical challenges of cross-lingual long-video comprehension, strict persona adherence, and zero-hallucination temporal grounding,…

arXiv cs.AI TIER_1 English(EN) · Chunhua Shen · 2026-06-05 17:59

MemDreamer：通过分层图记忆和代理检索机制解耦感知与推理，实现长视频理解

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understan…

arXiv cs.CL TIER_1 English(EN) · Shiqiang Lang, Jing Liu, Haoyang He, Peiwen Sun, Yuanteng Chen, Tao Liu, Lan Yang, Longteng Guo, Honggang Zhang · 2026-06-05 04:00

LongSpace：探索从感知到回忆的视频长时空记忆

arXiv:2606.05677v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recogniz…

arXiv cs.CL TIER_1 English(EN) · Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, Juan Carlos Niebles · 2026-06-05 04:00

主动视频感知：用于智能体长视频理解的迭代证据搜寻

arXiv:2512.05774v2 Announce Type: replace-cross Abstract: Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines impr…

arXiv cs.CL TIER_1 English(EN) · Kejuan Yang, Yizhuo Zhang, Mingyuan Du, Yue Zhang, Dixin Zheng, Kaili Zhao, Yang Xiao, Hanzhong Liang, Kenan Xiao · 2026-06-05 04:00

UNIVID：统一视觉语言模型用于视频审核

arXiv:2606.05748v1 Announce Type: cross Abstract: Global-scale video moderation faces a dual challenge: the need for fine-grained multi-modal reasoning and the demand for interpretable outputs to support downstream enforcement. Traditional moderation systems often rely on fragmen…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-05 00:00

观看、记忆、推理：使用 MLLMs 进行人类视角的视频理解

Multimodal large language models for video understanding are structured around three core capabilities—watching, remembering, and reasoning—with applications spanning multiple video domains and addressing challenges in perception, memory, and reasoning.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-05 00:00

MemDreamer：通过分层图记忆和代理检索机制解耦感知与推理，实现长视频理解

MemDreamer addresses long-video understanding challenges by decoupling perception and reasoning through hierarchical graph memory and agentic exploration, achieving state-of-the-art performance with reduced computational overhead.

arXiv cs.AI TIER_1 English(EN) · Jie Huang, Ruixun Liu, Sirui Sun, Xinyi Yang, Yin Li, Yixin Zhu, Yiwu Zhong · 2026-06-04 04:00

M$^3$Eval：通过认知基础视频任务进行多模态记忆评估

arXiv:2606.05008v1 Announce Type: cross Abstract: As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception an…

arXiv cs.CL TIER_1 English(EN) · Huangchen Xu, Yuan Wu, Yi Chang · 2026-06-04 04:00

VCIFBench：评估视频理解的复杂指令遵循能力

arXiv:2606.04588v1 Announce Type: new Abstract: Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints. We i…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 00:00

想象先于预测：交错的潜在视觉推理用于视频事件预测

Future-L1, an interleaved latent visual reasoning framework, improves video event prediction by maintaining visual semantics in latent space during autoregressive decoding, achieving state-of-the-art results on FutureBench and TwiFF-Bench benchmarks.

arXiv cs.CL TIER_1 English(EN) · Yiwu Zhong · 2026-06-03 15:28

M$^3$Eval：通过认知基础视频任务进行多模态记忆评估

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating mem…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-03 08:27

VCIFBench：评估视频理解的复杂指令遵循能力

Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints. We introduce VCIFBench, a benchmark for evaluating c…

arXiv cs.CL TIER_1 English(EN) · Yi Chang · 2026-06-03 08:27

VCIFBench：评估视频理解的复杂指令遵循能力

Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints. We introduce VCIFBench, a benchmark for evaluating c…

arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Kun Gai · 2026-06-03 04:49

短视频与直播的融合：用于跨域表征学习的推理引导多模态大模型

As live streaming services grow, many platforms offer short videos and live streams to meet diverse needs. Short videos carry substantial traffic and rich behavior signals, whereas live streaming is a core conversion scenario with sparse behavior data, making cold start severe. T…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-03 00:00

VideoKR：迈向知识和推理密集型视频理解

VideoKR presents a large-scale video reasoning dataset and benchmark designed to enhance knowledge-intensive video understanding through expert-domain content and human-in-the-loop example generation.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-03 00:00

M^3Eval：通过认知基础视频任务进行多模态记忆评估

Multi-modal models exhibit significant limitations in memory capabilities, particularly in maintaining disentangled representations and demonstrating human-like interference patterns, highlighting the need for improved memory mechanisms in video understanding systems.

arXiv cs.LG TIER_1 English(EN) · Chenwei Xu, Zhen Ye, Shang Wu, Weijian Li, Zihan Wang, Zhuofan Xia, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu · 2026-06-02 04:00

迈向稀疏视频理解与推理

arXiv:2602.13602v2 Announce Type: replace-cross Abstract: We present \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informati…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 00:00

多模态视频理解中视觉状态跟踪的基准测试

Current multimodal large language models struggle with visual state tracking in videos, performing poorly even when human-level capabilities are required, and existing agentic approaches do not effectively address these limitations.

arXiv cs.AI TIER_1 English(EN) · Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv, Hui Li · 2026-05-29 04:00

高效长视频推理的语义和视觉证据：HD-EPIC VQA挑战赛的解决方案

arXiv:2605.29402v1 Announce Type: cross Abstract: Understanding long-form egocentric videos remains challenging for multimodal large language models (MLLMs) due to limited context length and insufficient grounding of fine-grained visual details. The recently proposed HD-EPIC benc…

arXiv cs.AI TIER_1 English(EN) · Peng Zhang, Guanghao Zhang, Wanggui He, Longxiang Zhang, Mushui Liu, Yan Xia, Zhenhao Peng, Weilong Dai, Jinlong Liu, Haobing Tang, Le Zhang, Hao Jiang, Pipei Huang · 2026-05-27 04:00

DynFrame：自适应推理驱动的多模态框架，具有动态帧增强功能，用于复杂视频理解

arXiv:2605.26680v1 Announce Type: cross Abstract: Recent video multimodal large language models (MLLMs) increasingly couple step-by-step reasoning with on-demand visual evidence retrieval, allowing models to revisit relevant video segments during inference. However, two structura…

arXiv cs.AI TIER_1 English(EN) · Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang · 2026-05-27 04:00

解读视频推理

arXiv:2603.16870v2 Announce Type: replace-cross Abstract: Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where re…

arXiv cs.CL TIER_1 English(EN) · Yiming Liang, Yixiao Chen, Yiyang Zhou, Yixuan Wang, Shoubin Yu, Andong Deng, Fuxiao Liu, Qin Zhang, Chen Chen, Mohit Bansal, Huaxiu Yao · 2026-05-26 04:00

STORM：视频语言模型中用于时空推理的内部化建模

arXiv:2605.26014v1 Announce Type: cross Abstract: Many video reasoning tasks require tracking motion, temporal order, and evolving visual states across frames. Existing methods built on large vision-language models (LVLMs) often address this challenge by externalizing reasoning t…

arXiv cs.AI TIER_1 English(EN) · Eunbyung Park · 2026-05-21 14:48

VGenST-Bench：通过主动视频合成进行时空推理的基准测试

Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static ima…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-21 00:00

VGenST-Bench：通过主动视频合成进行时空推理的基准测试

VGenST-Bench presents a video benchmark using generative models for active synthesis of controlled spatio-temporal reasoning scenarios with human quality control.

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Yunpu Ma · 2026-05-16 16:15

PyraVid：用于长时视频推理的分层多模态记忆

Memory has become an increasingly important component of agentic systems, as these systems are expected to reason over long-term experience. However, prior work has largely focused on unimodal memory, leaving multimodal memory relatively underexplored despite its central role in …

arXiv cs.CV TIER_1 English(EN) · Felix Tristram, Ege \"Ozsoy, Christian Benz, Marcel Walch, Ghazal Ghazaei, Nassir Navab · 2026-06-12 04:00

OR-Action：具有细粒度动作的多角色视频理解

arXiv:2606.13332v1 Announce Type: new Abstract: Fine-grained understanding of operating room (OR) activity could enable workflow-aware assistance, yet remains difficult due to clutter, occlusions, and limited sensing. The prevailing approach to model this environment is scene gra…

arXiv cs.CV TIER_1 English(EN) · Nassir Navab · 2026-06-11 13:24

OR-Action：具有细粒度动作的多角色视频理解

Fine-grained understanding of operating room (OR) activity could enable workflow-aware assistance, yet remains difficult due to clutter, occlusions, and limited sensing. The prevailing approach to model this environment is scene graphs as an interpretable representation of OR int…

arXiv cs.CV TIER_1 English(EN) · Omkar Thawakar, Dmitry Demidov, Vaishnav Potlapalli, Sai Prasanna Teja Reddy Bogireddy, Viswanatha Reddy Gajjala, Alaa Mostafa Lasheen, Rao Muhammad Anwer, Fahad Khan · 2026-06-11 04:00

CoVR-R：面向推理的组合视频检索

arXiv:2603.20190v2 Announce Type: replace Abstract: Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit …

arXiv cs.CV TIER_1 English(EN) · Yuchen Guan, Xiao Li, Zongyu Guo, Xiaoyi Zhang, Xiulian Peng, Chun Yuan, Yan Lu · 2026-06-11 04:00

从内容到知识：利用神经知识表示实现闪电般的长视频理解

arXiv:2606.11913v1 Announce Type: new Abstract: We propose a new paradigm for long video understanding by treating a long video as a Neural Knowledge Representation (NKR). NKR represents video contents neither as a stream of tokens nor pre-organized databases, but as an individua…

arXiv cs.CV TIER_1 English(EN) · Biao Tang, Xu Chen, Shuxiang Gou, Jingyi Yuan, Yuhan Zhang, Chenqiang Gao · 2026-06-11 04:00

Q-Fold：查询感知焦点-上下文时空折叠用于长视频理解

arXiv:2606.12125v1 Announce Type: new Abstract: Long-video understanding remains challenging for multimodal large language models, because temporally extended videos often contain thousands of frames and are therefore expensive to process exhaustively. Existing methods usually co…

arXiv cs.CV TIER_1 English(EN) · Min Yang, Zichen Zhang, Qian Dang, Limin Wang · 2026-06-11 04:00

Temporal2Seq：统一的视频时序理解任务框架

arXiv:2409.18478v2 Announce Type: replace Abstract: With the development of video understanding, there is a proliferation of tasks for clip-level temporal video analysis, including temporal action detection (TAD), temporal action segmentation (TAS), and generic event boundary det…

arXiv cs.CV TIER_1 English(EN) · Chenqiang Gao · 2026-06-10 14:19

Q-Fold：查询感知焦点-上下文时空折叠用于长视频理解

Long-video understanding remains challenging for multimodal large language models, because temporally extended videos often contain thousands of frames and are therefore expensive to process exhaustively. Existing methods usually construct compact visual inputs from long videos u…

arXiv cs.CV TIER_1 English(EN) · Yan Lu · 2026-06-10 10:43

从内容到知识：利用神经知识表示实现闪电般的长视频理解

We propose a new paradigm for long video understanding by treating a long video as a Neural Knowledge Representation (NKR). NKR represents video contents neither as a stream of tokens nor pre-organized databases, but as an individual small portion of network weights attached to t…

arXiv cs.CV TIER_1 Română(RO) · Jie Zhang, Qilang Ye, Hao Zhou, Haochen Liang, Fei Luo · 2026-06-09 04:00

MAVIS：通过结构化视频理解实现多智能体视频检索

arXiv:2606.09641v1 Announce Type: new Abstract: The dominant paradigm in video retrieval relies on embedding-based full-corpus scanning, which suffers from inherent computational inefficiency and the semantic asymmetry between information-dense videos and sparse textual queries. …

arXiv cs.CV TIER_1 Română(RO) · Fei Luo · 2026-06-08 15:36

MAVIS：通过结构化视频理解实现多智能体视频检索

The dominant paradigm in video retrieval relies on embedding-based full-corpus scanning, which suffers from inherent computational inefficiency and the semantic asymmetry between information-dense videos and sparse textual queries. To bridge this gap, we introduce \textbf{MAVIS},…

arXiv cs.CV TIER_1 English(EN) · Haozhe Chi, Yang Jin, Yadong Mu · 2026-06-08 04:00

GOPAgen：具有结构化记忆和分层推理的运动感知高效智能长视频理解

arXiv:2606.06532v1 Announce Type: new Abstract: Despite significant progress in agentic long video understanding, existing methods still lack detailed motion comprehension coupled with an efficient memory architecture. In this paper, we propose GOPAgen, a novel approach that firs…

arXiv cs.CV TIER_1 English(EN) · Minghsuan Yang · 2026-06-05 16:29

观看、记忆、推理：使用 MLLMs 进行人类视角的视频理解

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multim…

arXiv cs.CV TIER_1 English(EN) · Changsheng Xu · 2026-06-05 07:29

不要暂停：在线视频理解的流式视频-语言同步

Online Video Large Language Models (Video-LLMs) have advanced toward seamless human-AI interaction through frame-by-frame processing and proactive responding. However, a critical challenge remains in streaming scenarios: existing models typically pause video perception while gene…

arXiv cs.CV TIER_1 English(EN) · Tianxiang Jiang, Linquan Wu, Sheng Xia, Songze Li, Ziang Yan, Haoyu Yang, Yu Qiao, Yi Wang · 2026-06-05 04:00

想象先于预测：交错的潜在视觉推理用于视频事件预测

arXiv:2606.05769v1 Announce Type: new Abstract: Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine…

arXiv cs.CV TIER_1 English(EN) · Lin Fu, Zheyuan Yang, Yang Wang, Tingyu Song, Arman Cohan, Yilun Zhao · 2026-06-05 04:00

VideoKR：迈向知识与推理密集型视频理解

arXiv:2606.05259v1 Announce Type: new Abstract: We introduce VideoKR, the first large-scale training corpus specifically designed to strengthen knowledge- and reasoning-intensive video understanding. It comprises 315K video reasoning examples over 145K newly collected, CC-license…

arXiv cs.CV TIER_1 English(EN) · Shufan Zhang, Ziyue Lin, Bairun Wang, Lei Jin, Xuanding Ding, Xinzhu Ma, Kunlin Yang · 2026-06-05 04:00

VTI-CoT：用于视频推理的视觉-文本交错思维链

arXiv:2606.05736v1 Announce Type: new Abstract: Video reasoning aims to understand complex temporal events and causal relationships within videos. Recently, Chain-of-Thought (CoT) has been introduced to this field to enhance reasoning accuracy. However, existing CoT-based video r…

arXiv cs.CV TIER_1 English(EN) · Zhengqian Wu, Zhixian Liu, Aodong Chen, Jingyang Zhang, Ruizhe Li, Hanlin Ge, Zhongyuan Wang, Chunxia Xiao, Chao Liang · 2026-06-05 04:00

StoryVideoQA：使用大规模、多类型、自动生成的语料库扩展深度视频理解能力

arXiv:2606.06338v1 Announce Type: new Abstract: Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex story…

arXiv cs.CV TIER_1 English(EN) · Chao Liang · 2026-06-04 16:12

StoryVideoQA：使用大规模、多类型、自动生成的视频问答数据集扩展深度视频理解能力

Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex storylines. This challenge arises from the inherent l…

arXiv cs.CV TIER_1 English(EN) · Yi Wang · 2026-06-04 06:53

想象先于预测：交错的潜在视觉推理用于视频事件预测

Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues …

arXiv cs.CV TIER_1 English(EN) · Kunlin Yang · 2026-06-04 05:55

VTI-CoT：用于视频推理的视觉-文本交织思维链

Video reasoning aims to understand complex temporal events and causal relationships within videos. Recently, Chain-of-Thought (CoT) has been introduced to this field to enhance reasoning accuracy. However, existing CoT-based video reasoning methods primarily rely on text-only inf…

arXiv cs.CV TIER_1 English(EN) · Honggang Zhang · 2026-06-04 04:00

LongSpace：探索从感知到回忆的视频长时空记忆

Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and …

arXiv cs.CV TIER_1 English(EN) · Chenhao Zheng, Jieyu Zhang, Jianing Zhang, Weikai Huang, Ashutosh Kumar, Quan Kong, Oncel Tuzel, Chun-Liang Li, Ranjay Krishna · 2026-06-04 04:00

TrajTok：学习轨迹令牌可实现更好的视频理解

arXiv:2602.22779v3 Announce Type: replace Abstract: Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promi…

arXiv cs.CV TIER_1 English(EN) · Sihyun Yu, Nanye Ma, Pinzhi Huang, Hyunseok Lee, Shusheng Yang, June Suk Choi, Ellis Brown, Oscar Michel, Boyang Zheng, Jinwoo Shin, Saining Xie · 2026-06-03 04:00

多模态视频理解中视觉状态跟踪的基准测试

arXiv:2606.03920v1 Announce Type: new Abstract: Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains u…

arXiv cs.CV TIER_1 English(EN) · Saining Xie · 2026-06-02 17:12

多模态视频理解中视觉状态跟踪的基准测试

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimoda…

arXiv cs.CV TIER_1 English(EN) · Junbo Zou, Ziheng Huang, Shengjie Zhang, Liwen Zhang, Weining Shen · 2026-06-02 04:00

VideoBrain：学习自适应帧采样以理解长视频

arXiv:2602.04094v2 Announce Type: replace Abstract: Long-form video understanding remains challenging for Vision-Language Models (VLMs) due to the inherent tension between computational constraints and the need to capture information distributed across thousands of frames. Existi…

arXiv cs.CV TIER_1 English(EN) · Jinming Liu, Jianguo Huang, Zhaoyang Jia, Jiahao Li, Xiaoyi Zhang, Zongyu Guo, Bin Li, Wenjun Zeng, Yan Lu, Xin Jin · 2026-06-02 04:00

一种具有代理控制的高效流式视频理解框架

arXiv:2605.17921v2 Announce Type: replace Abstract: Streaming video requires handling dynamic information density under strict latency budgets. Yet, existing methods typically employ static strategies, such as fixed memory compression or reliance on a single model, forcing a trad…

arXiv cs.CV TIER_1 English(EN) · Raghad Albusayes, Munirah Alyahya · 2026-06-02 04:00

CVPR 2026 CASTLE挑战赛第三名：通过分层知识图谱检索实现Agentic多视角长上下文视频理解

arXiv:2606.01933v1 Announce Type: new Abstract: This paper presents our winning methodology for the CASTLE 2026 Challenge at the CVPR 2026 EgoVis Workshop, where our team secured third place globally. The challenge tasks participants with answering highly complex visual, spatiote…

arXiv cs.CV TIER_1 English(EN) · Yuan Xie, Tianshui Chen, Zheng Ge, Lionel Ni · 2026-06-01 04:00

Video-MTR：长视频理解的强化多轮推理

arXiv:2508.20478v2 Announce Type: replace Abstract: Long-form video understanding, characterized by long-range temporal dependencies and multiple events, remains a challenge. Existing methods often rely on static reasoning or external visual-language models (VLMs), which face iss…

arXiv cs.CV TIER_1 English(EN) · Xianqiang Gao, Qizhi Chen, Delin Qu, Haoming Song, Zhigang Wang, Bin Zhao, Dong Wang, Xuelong Li · 2026-05-27 04:00

Q-GeoMem：用于视频空间推理的问导几何记忆

arXiv:2605.27318v1 Announce Type: new Abstract: Video spatial reasoning requires accumulating viewpoint-dependent evidence over time while retaining information useful to the question being asked. Existing spatial video-language models improve geometric perception and long-range …

arXiv cs.CV TIER_1 English(EN) · Xuelong Li · 2026-05-26 17:26

Q-GeoMem：用于视频空间推理的问询式几何记忆

Video spatial reasoning requires accumulating viewpoint-dependent evidence over time while retaining information useful to the question being asked. Existing spatial video-language models improve geometric perception and long-range context modeling, but often treat memory as a ge…

arXiv cs.CV TIER_1 English(EN) · Pipei Huang · 2026-05-26 08:16

DynFrame：自适应推理驱动的多模态框架，具有动态帧增强功能，用于复杂视频理解

Recent video multimodal large language models (MLLMs) increasingly couple step-by-step reasoning with on-demand visual evidence retrieval, allowing models to revisit relevant video segments during inference. However, two structural gaps remain in existing thinking-with-video syst…

arXiv cs.CV TIER_1 English(EN) · Huaxiu Yao · 2026-05-25 16:33

STORM：视频语言模型中用于时空推理的内部化建模

Many video reasoning tasks require tracking motion, temporal order, and evolving visual states across frames. Existing methods built on large vision-language models (LVLMs) often address this challenge by externalizing reasoning through textual chain-of-thought (CoT), keyframe se…

arXiv cs.CV TIER_1 English(EN) · Mingfang Zhang, Jingjing Pan, Ashutosh Kumar, Rajat Saini, Mustafa Erdogan, Hsuan-Kung Yang, Caixin Kang, Yifei Huang, Yoichi Sato, Quan Kong · 2026-05-25 04:00

CaST-Bench：为视频问答基准测试因果链式时空推理

arXiv:2605.23216v1 Announce Type: new Abstract: Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rare…

arXiv cs.CV TIER_1 English(EN) · Quan Kong · 2026-05-22 04:19

CaST-Bench：用于视频问答的因果链基础时空推理基准测试

Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence n…

arXiv cs.CV TIER_1 English(EN) · Jinho Park, Youbin Kim, Hogun Park, Eunbyung Park · 2026-05-22 04:00

VGenST-Bench：通过主动视频合成进行时空推理的基准测试

arXiv:2605.22570v1 Announce Type: new Abstract: Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning…

报道来源 [71]

相关实体

相关话题