新的基准和方法增强了大型语言模型在视觉和多模态任务中的推理能力

arXiv cs.AI TIER_1 English(EN) · Arthur Zhang, Carl Qi, Donne Su, Xiangyun Meng, Amy Zhang, Joydeep Biswas · 2026-06-12 04:00

Foresight: Iterative Reasoning About Clues that Matter for Navigation

arXiv:2606.12550v1 Announce Type: cross Abstract: Open-world mapless navigation from sparse language instructions requires resolving underspecified goals and inferring which environmental cues are relevant for reaching the goal. For instance, reaching an out-of-view destination m…

arXiv cs.AI TIER_1 English(EN) · Yantao Li, Qiang Hui, Chenyang Yan, Kanzhi Cheng, Fang Zhao, Chao Tan, Huanling Gao, Jianbing Zhang, Kai Wang, Xinyu Dai, Shiguo Lian · 2026-06-12 04:00

PaLMR：通过多模态过程对齐实现忠实的视觉推理

arXiv:2603.06652v2 Announce Type: replace-cross Abstract: Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinat…

arXiv cs.AI TIER_1 English(EN) · Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song, Sifei Liu, Subhashree Radhakrishnan, Seungryong Kim, Yu-Chiang Frank Wang, Min-Hung Chen · 2026-06-12 04:00

SpatialClaw：重新思考用于Agentic空间推理的动作接口

arXiv:2606.13673v1 Announce Type: cross Abstract: Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmentin…

arXiv cs.AI TIER_1 English(EN) · Changye Li, Meng Lu, Yi Wu, Ligeng Zhu · 2026-06-12 04:00

感知、交互、推理：构建增强工具的空间推理视觉代理

arXiv:2606.12830v1 Announce Type: cross Abstract: While recent vision-language models (VLMs) demonstrate strong multimodal understanding, they remain limited in spatial reasoning tasks that require active evidence acquisition and multi-step visual interaction. This limitation sug…

arXiv cs.AI TIER_1 English(EN) · Min-Hung Chen · 2026-06-11 17:59

SpatialClaw：重新思考用于Agentic空间推理的动作接口

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet the…

arXiv cs.AI TIER_1 English(EN) · Chao Lei, Yanbei Jiang, Markus Hiller, Zhijian Zhou, Xunye Tian, Krista A. Ehinger, Nir Lipovetzky · 2026-06-11 04:00

SVoT：基于强化学习的空间推理的状态感知思维可视化

arXiv:2606.11770v1 Announce Type: new Abstract: Spatial reasoning remains a challenge for Multimodal Large Language Models (MLLMs), as it requires reliable multi-hop inference over both intermediate states and state transitions. Current studies often leave intermediate states unv…

arXiv cs.AI TIER_1 English(EN) · Chaofan Ma, Zhenjie Mao, Yuhuan Yang, Fanqin Zeng, Yue Shi, Yingjie Zhou, Xiaofeng Cao, Jiangchao Yao · 2026-06-11 04:00

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

arXiv:2606.11683v1 Announce Type: cross Abstract: Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambigu…

arXiv cs.AI TIER_1 English(EN) · Theo Uscidda, Marta Tintore Gazulla, Maks Ovsjanikov, Federico Tombari, Leonidas Guibas · 2026-06-11 04:00

审讯的艺术：一致性增强空间推理中的事实性

arXiv:2606.11918v1 Announce Type: new Abstract: Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine-tuning (S…

arXiv cs.AI TIER_1 English(EN) · Baoyang Jiang, Fengchun Zhang, Leyuan Wang, Haotian Li, Yida Wang, Zhe Ji, Jinshan Lai, Xi Ren, Jianwei Hu, Qiang Ma · 2026-06-11 04:00

Embodied-BenchClaw：一个用于具身空间智能基准构建的自主多智能体系统

arXiv:2606.11909v1 Announce Type: new Abstract: Benchmarks are essential for evaluating embodied spatial intelligence, yet their construction is labor-intensive, hard to reuse, and difficult to maintain. Existing embodied benchmarks are often static and may quickly become saturat…

arXiv cs.AI TIER_1 English(EN) · Lachlan McPheat, Navdeep Kaur, Robert Blackwell, Alessandra Russo, Anthony G. Cohn, Pranava Madhyastha · 2026-06-11 04:00

DecompSR: 一个用于组合式多跳空间推理分解分析的数据集

arXiv:2511.02627v3 Announce Type: replace Abstract: We introduce DecompSR, decomposed spatial reasoning, a large benchmark dataset (over 5m datapoints) and generation framework designed to analyse compositional spatial reasoning ability. The generation of DecompSR allows users to…

arXiv cs.AI TIER_1 English(EN) · Enhan Zhao, Wei Wu, Yuanrui Zhang, Xueliang Zhao, Di He · 2026-06-11 04:00

Ouroboros-Spatial：为空间推理闭合数据-模型循环

arXiv:2606.11719v1 Announce Type: cross Abstract: Spatial reasoning remains a persistent challenge for multimodal large language models (MLLMs). Existing approaches largely rely on large-scale, statically curated datasets, where all training samples are treated uniformly regardle…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 00:00

SpatialClaw：重新思考用于Agentic空间推理的动作接口

SpatialClaw is a training-free framework that uses code as an action interface to enable flexible, stateful spatial reasoning in vision-language models, achieving superior performance across diverse 3D/4D spatial reasoning tasks.

arXiv cs.AI TIER_1 English(EN) · Leonidas Guibas · 2026-06-10 10:50

审讯的艺术：一致性增强空间推理中的事实性

Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine-tuning (SFT) to ingest labeled spatial data from external…

arXiv cs.AI TIER_1 English(EN) · Qiang Ma · 2026-06-10 10:37

Embodied-BenchClaw：一个用于具身空间智能基准构建的自主多智能体系统

Benchmarks are essential for evaluating embodied spatial intelligence, yet their construction is labor-intensive, hard to reuse, and difficult to maintain. Existing embodied benchmarks are often static and may quickly become saturated as models improve, limiting their ability to …

arXiv cs.AI TIER_1 English(EN) · Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Sumitra Ganesh, Manuela Veloso · 2026-06-10 04:00

ChartAgent：用于复杂图表问答中视觉基础推理的多模态代理

arXiv:2510.04514v3 Announce Type: replace Abstract: Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts-those requiring precise visual interpretation rather than relying on textual shortc…

arXiv cs.AI TIER_1 English(EN) · Chenrui Fan, Yijun Liang, Shweta Bhardwaj, Kwesi Cobbina, Ming Li, Tianyi Zhou · 2026-06-10 04:00

V-REX：通过链式提问进行探索性视觉推理的基准测试

arXiv:2512.11995v2 Announce Type: replace-cross Abstract: While many vision-language models (VLMs) are developed to answer well-defined, straightforward questions with highly specified targets, as in most benchmarks, they often struggle in practice with complex open-ended tasks, …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 00:00

推理，然后重新推理：跨视图 पुनरावलोकन 改进空间推理

A training-free framework for spatial reasoning from egocentric videos that enables revisiting conclusions through synthesized novel-view videos generated from predicted 3D geometry.

arXiv cs.AI TIER_1 English(EN) · Hongcheng Gao, Hailong Qu, Jingyi Tang, Jiahao Wang, Zihao Huang, Hengkang Qiao, Shihong Huang, Junming Yang, Yi Li, Hongyixuan Yuan, Wenjie Li, Bohan Zeng, Wenbo Li, Bo Wang, Jianhui Liu, Olive Huang, Haoyang Huang, Wentao Zhang, Guoqing Huang, Nan Duan… · 2026-06-09 04:00

SpatialWorld：在真实世界任务中对多模态智能体进行交互式空间推理的基准测试

arXiv:2606.09669v1 Announce Type: new Abstract: Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) …

arXiv cs.AI TIER_1 English(EN) · Yucheng Deng, Pingrui Lai, Xinhai Li, Chenjia Bai, Xiaoheng Deng, Chengnuo Sun, Xuelong Li, Hua Yang · 2026-06-09 04:00

SpaceVLN：一种具有在线空间认知记忆和推理的零样本视觉语言导航代理

arXiv:2606.08992v1 Announce Type: cross Abstract: Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a pro…

arXiv cs.AI TIER_1 English(EN) · Yinpeng Dong · 2026-06-08 15:51

SpatialWorld：在真实任务中对多模态智能体进行交互式空间推理的基准测试

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to asse…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-08 00:00

Visual Para-Thinker++：用于视觉推理的单一策略多智能体框架

A multi-agent framework with shared MLLM policy and role-specific training methods improves visual reasoning by reducing hallucinations and enabling efficient parallel processing.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-08 00:00

SpatialWorld：在真实任务中对多模态智能体进行交互式空间推理的基准测试

SpatialWorld presents a unified benchmark for evaluating interactive spatial understanding in multimodal agents through diverse real-world tasks with partial observability and text-based actions.

arXiv cs.AI TIER_1 English(EN) · Tianyi Tang, Zhuoyi Lin, Zeyu Feng, Tianyi Ma, Yew-Soon Ong, Ivor Tsang, Haiyan Yin · 2026-06-06 04:00

用于物理推理的因果脚手架：用于视觉语言模型中因果信息驱动的物理世界理解的基准

arXiv:2606.05966v1 Announce Type: cross Abstract: Understanding and reasoning about the physical world is the foundation of intelligent behavior, yet state-of-the-art vision-language models (VLMs) still fail at causal physical reasoning, often producing plausible but incorrect an…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-06 00:00

DyCo-RL：视觉推理的动态跨模态协调

Dynamic cross-modal coordination is integrated into reinforcement learning with verifiable rewards to improve visual reasoning in multimodal large language models by measuring attention shifts and aligning token roles during chain-of-thought reasoning.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-05 16:33

Skill-3D：为具身三维空间推理演进场景感知技能

Skill-3D framework enables agents to learn scene-aware skills through self-evolving memory and skill libraries, improving tool utilization in 3D spatial reasoning tasks.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 12:13

从符号状态通过模态差距感知自蒸馏学习视觉空间规划

While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over th…

arXiv cs.AI TIER_1 English(EN) · Sichao Li, Sai Ma, Daniel Kilov, Secil Yanik Guyot, Zhuang Li, Seth Lazar · 2026-06-04 04:00

NoRA：评估视觉第一人称规范性动作推理中的基础合理性

arXiv:2606.04806v1 Announce Type: cross Abstract: LLMs and agentic systems are increasingly deployed in social environments, making normative competence critical for safe and appropriate behavior. However, existing approaches either assess normative judgment in text alone or redu…

arXiv cs.AI TIER_1 English(EN) · Charlie Gauthier, Sacha Morin, Liam Paull · 2026-06-04 04:00

PerceptTwin: 用于迭代式 LLM 规划和验证的语义场景重建

arXiv:2606.04226v1 Announce Type: cross Abstract: Simulation environments are useful for both robot policy learning and planning verification and validation. Traditionally, the process of creating a simulation was onerous. Creating a bespoke simulation environment for each indivi…

arXiv cs.AI TIER_1 English(EN) · Guangcheng Zhu, Shenzhi Yang, Haobo Wang, Xing Zheng, Yingfan MA, Xuening Feng, Zhongqi Chen, Bowen Song, Weiqiang Wang, Gang Chen · 2026-06-04 04:00

暗中智能选择：通过追踪元认知枢轴实现高效RLVR推理

arXiv:2606.04503v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has greatly advanced large reasoning models (LRMs), but it requires timely training on a huge fully-annotated dataset. To this end, data-efficient RLVR methods have been widely…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 00:00

WorldBench：一个具有挑战性且视觉多样化的多模态推理基准

WorldBench is introduced as a visually diverse reasoning benchmark for evaluating multimodal large language models, revealing significant limitations in current models' visual understanding capabilities.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 00:00

以想象力思考：具身视觉空间推理与世界模拟器

Astra is an agentic spatial reasoning framework that enhances Vision-Language Models with action-conditioned visual imagination by coupling a reinforcement learning-trained policy with a world simulator for generating novel-view observations.

arXiv cs.AI TIER_1 English(EN) · Seth Lazar · 2026-06-03 12:30

NoRA：评估视觉第一人称规范性动作推理中的基础合理性

LLMs and agentic systems are increasingly deployed in social environments, making normative competence critical for safe and appropriate behavior. However, existing approaches either assess normative judgment in text alone or reduce it to choosing among a fixed set of candidate a…

arXiv cs.AI TIER_1 English(EN) · Hang He, Chuhuai Yue, Chengqi Dong, Chengcheng Wan, Ting Su, Haiying Sun, Jiajun Chai, Xiaohan Wang, Guojun Yin · 2026-06-03 04:00

VistaHop：为视觉深度搜索进行多跳视觉推理基准测试

arXiv:2606.03273v1 Announce Type: cross Abstract: Visual DeepSearch requires multimodal large reasoning model (MLRM) agents to answer complex visual queries by repeatedly inspecting image regions, grounding intermediate reasoning in visual evidence, and connecting fine-grained cl…

arXiv cs.AI TIER_1 English(EN) · Senjie Jin, Peixin Wang, Boyang Liu, Xiaoran Fan, Shuo Li, Zhiheng Xi, Jiazheng Zhang, Yuhao Zhou, Tao Gui, Qi Zhang, Xuanjing Huang · 2026-06-03 04:00

熵不足以：通过视觉锚定的代币选择解锁视觉推理的有效强化学习

arXiv:2606.03937v1 Announce Type: new Abstract: While token-level entropy is commonly recognized as effective for credit assignment in text-only reinforcement learning with verifiable rewards (RLVR), it remains unclear whether this mechanism still holds in visual reasoning. Our c…

arXiv cs.LG TIER_1 English(EN) · Yixian Shen, Zhiheng Yang, Qi Bi, Changshuo Wang, Shuai Wang, Jia-Hong Huang, George Floros, Prayag Tiwari, Anuj Pathania · 2026-06-03 04:00

用于轻量级多模态推理的光谱渐进式思维流

arXiv:2606.02842v1 Announce Type: new Abstract: Multimodal spatial reasoning often relies on long chains of intermediate textual and visual thoughts, where accumulating visual tokens and dense cross-modal attention incur substantial computation and memory overhead. To address thi…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-03 00:00

富有想象力的感知令牌增强多模态语言模型中的空间推理能力

Imaginative Perception Tokens (IPT) enhance vision-language models' spatial reasoning by providing intermediate perceptual representations that externalize what the model would perceive from alternative viewpoints, outperforming traditional text-based reasoning methods.

arXiv cs.AI TIER_1 English(EN) · Xuanjing Huang · 2026-06-02 17:26

熵不足以：通过视觉锚定的代币选择解锁视觉推理的有效强化学习

While token-level entropy is commonly recognized as effective for credit assignment in text-only reinforcement learning with verifiable rewards (RLVR), it remains unclear whether this mechanism still holds in visual reasoning. Our controlled study shows that this mechanism collap…

arXiv cs.CL TIER_1 English(EN) · Guojun Yin · 2026-06-02 07:37

VistaHop：为 Visual DeepSearch 建立多跳视觉推理基准

Visual DeepSearch requires multimodal large reasoning model (MLRM) agents to answer complex visual queries by repeatedly inspecting image regions, grounding intermediate reasoning in visual evidence, and connecting fine-grained clues across long reasoning chains. However, existin…

arXiv cs.AI TIER_1 (CA) · Kangning Zhang, Shuai Shao, Qingyao Li, Jianghao Lin, Lingyue Fu, Shijian Wang, Wenxiang Jiao, Yuan Lu, Weiwen Liu, Weinan Zhang, Yong Yu · 2026-06-02 04:00

MMSkills：迈向通用视觉智能体多模态技能

arXiv:2605.13527v3 Announce Type: replace Abstract: Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, howe…

arXiv cs.AI TIER_1 English(EN) · Tianze Yang, Yucheng Shi, Ruitong Sun, Jingyuan Huang, Ninghao Liu, Jin Sun · 2026-06-02 04:00

TRON：视觉推理强化学习的定向规则可验证在线环境

arXiv:2606.01599v1 Announce Type: new Abstract: Reinforcement learning (RL) for visual reasoning needs scalable, verifiable, and controllable training signals. Existing visual RL post-training trains on static curated datasets, with fixed image-question-answer samples bounded by …

arXiv cs.AI TIER_1 English(EN) · Xixiang He, Baiqi Wu, Xingming Li, Ao Cheng, Qiyao Sun, Xuanyu Ji, Qingyong Hu · 2026-06-02 04:00

StemBind：当MLLM在抽象视觉推理中迷失于规则与实例之间

arXiv:2606.00148v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) often know the rule but pick the wrong answer: on abstract visual reasoning (AVR) tasks, a model can describe what it sees and name the underlying pattern, yet still fail to choose the matc…

arXiv cs.AI TIER_1 English(EN) · Garvin Guo, Yu Chen, Xiang Wang, Shuai Li, Xinpei Zhao, Huaxing Liu, Shuai Dong · 2026-06-02 04:00

超越视觉记忆：潜在视觉推理的机制诊断

arXiv:2606.01287v1 Announce Type: cross Abstract: Recent latent visual reasoning methods achieve substantial gains by inserting continuous latent tokens into multimodal language models. These gains are commonly attributed to the tokens encoding visual evidence; recent analyses, h…

arXiv cs.AI TIER_1 English(EN) · Oleksandr Nikitin · 2026-06-02 04:00

PlanarBench：通过平面图绘制评估大语言模型的空间推理能力

arXiv:2606.02010v1 Announce Type: cross Abstract: PlanarBench tests whether LLMs can draw planar graphs as ASCII art given only an edge list -- a spatial reasoning task that resists memorization because edge order, edge orientation, and node labels are all permutable. We evaluate…

arXiv cs.AI TIER_1 English(EN) · Gautam Sreekumar, Vishnu Naresh Boddeti · 2026-06-02 04:00

InPhyRe 发现：大型多模态模型在归纳物理推理方面存在困难

arXiv:2509.12263v3 Announce Type: replace Abstract: Large multimodal models (LMMs) encode physical laws observed during training, such as momentum conservation, as parametric knowledge. It allows LMMs to answer physical reasoning queries, such as the outcome of a potential collis…

arXiv cs.AI TIER_1 English(EN) · Yang Yu, Zhuangzhuang Chen, Lanqing Li, Xiaomeng Li · 2026-06-02 04:00

通过选择性对抗性熵干预增强基于RL的视觉推理

arXiv:2512.10414v2 Announce Type: replace Abstract: Recently, reinforcement learning (RL) has become a common choice in enhancing the reasoning capabilities of vision-language models (VLMs). Considering existing RL-based finetuning methods, entropy intervention turns out to be an…

arXiv cs.AI TIER_1 English(EN) · Zeyu Wang, Jingye Xu, Xiaogang Li, Peiyao Xiao, Qinhao Kong, Ben Wang, Chengliang Xu, Zichao Chen, Bing Zhao, Hu Wei · 2026-06-02 04:00

FeynmanBench：对图示物理推理的多模态大语言模型进行基准测试

arXiv:2604.03893v2 Announce Type: replace Abstract: Current multimodal benchmarks for scientific reasoning primarily evaluate local information extraction -- models recognize symbols and values and then perform textual inference. They do not assess whether models can reason over …

arXiv cs.AI TIER_1 English(EN) · Shoubin Yu, Yue Zhang, Zun Wang, Jaehong Yoon, Huaxiu Yao, Mingyu Ding, Mohit Bansal · 2026-06-02 04:00

何时以及如何想象：具有世界模型的视觉空间推理自适应测试时缩放

arXiv:2602.08236v2 Announce Type: replace-cross Abstract: Despite rapid progress in MLLMs, visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasonin…

arXiv cs.AI TIER_1 English(EN) · Yuxiang Shen, Hailong Huang, Zhenkun Gao, Xueheng Li, Man Zhou, Chengjun Xie, Haoxuan Che, Xuanhua He, Jie Zhang · 2026-06-02 04:00

LookWise：在多模态大语言模型中了解何时何地进行细粒度视觉推理

arXiv:2603.00171v3 Announce Type: replace-cross Abstract: Multimodal Large Language Models (MLLMs) are shifting towards "Thinking with Images" by actively exploring image details. While effective, large-scale training is computationally expensive, which has spurred growing intere…

arXiv cs.AI TIER_1 English(EN) · Xia Hu, Zhenrui Yue, Brian Potetz, Howard Zhou, Leonidas Guibas, Chun-Ta Lu, Zhicheng Wang · 2026-06-02 04:00

笛卡尔捷径：在极坐标空间中重新评估视觉推理

arXiv:2605.09883v2 Announce Type: replace-cross Abstract: As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores genuinely reflect robust visual understanding? We identify a pervasive vuln…

arXiv cs.CL TIER_1 English(EN) · Jixuan He, Xueting Li, Chieh Hubert Lin, Ming-Hsuan Yang · 2026-06-02 04:00

Reasmory：将3D重建作为显式记忆以增强VLMs的空间推理能力

arXiv:2606.00963v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) exhibit emerging spatial reasoning capabilities, yet they remain unreliable on tasks requiring precise spatial understanding, such as viewpoint reasoning, directional comparison, and distance estimati…

arXiv cs.CL TIER_1 English(EN) · Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei · 2026-06-02 04:00

Render-of-Thought：将文本思维链渲染为图像以进行视觉潜在推理

arXiv:2601.14750v4 Announce Type: replace Abstract: Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational…

arXiv cs.LG TIER_1 English(EN) · Dongchen Lu, Zhimo Li, Mao Shu, Huo Cao · 2026-06-02 04:00

DeepLatent：通过并行潜在视觉推理进行图像思考

arXiv:2606.00562v1 Announce Type: cross Abstract: The emerging paradigm of "thinking with images" embeds visual states into intermediate reasoning steps, defining a new frontier for Vision-Language Models. Existing approaches diverge along two lines. Tool-assisted methods apply e…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 00:00

通过宽基线匹配引发多模态大模型中的复杂空间推理

Wide-baseline matching presents a challenging spatial reasoning testbed for multimodal large language models, requiring systematic evaluation and training frameworks that current models lack, prompting the introduction of ReasonMatch-Bench and Dynamic Correspondence Reinforcement…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-01 16:30

像鸽子一样积极探索：通过代理视觉语言模型增强空间推理能力

Enabling Vision-Language Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications. Moreover, reinforcement learning methods rely on sparse rewards, limiting their effectiven…

arXiv cs.CL TIER_1 English(EN) · Oleksandr Nikitin · 2026-06-01 10:04

PlanarBench：通过平面图绘制评估LLM空间推理能力

PlanarBench tests whether LLMs can draw planar graphs as ASCII art given only an edge list -- a spatial reasoning task that resists memorization because edge order, edge orientation, and node labels are all permutable. We evaluate 91 models on the 199 simplest non-isomorphic conn…

arXiv cs.CL TIER_1 English(EN) · Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen · 2026-06-01 04:00

多轮多智能体对话协作重建提升VLM在空间推理上的表现，但效果甚微

arXiv:2605.31387v1 Announce Type: new Abstract: Robots operating in diverse environments rely on visual input to interpret objects and spatial layouts. In human-collaborative tasks, they are expected to communicate this understanding through language. Vision-language models (VLMs…

arXiv cs.AI TIER_1 English(EN) · Tianhui Liu, Jie Feng, Zhiheng Zheng, Shengyuan Wang, Yiming Guo, Yanxin Xi, Hangyu Fan, Yong Li, Pan Hui · 2026-06-01 04:00

SpatialAct：探测 VLM 智能体在 3D 场景中的空间推理到行动能力

arXiv:2605.31148v1 Announce Type: cross Abstract: Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs)…

arXiv cs.AI TIER_1 English(EN) · Ben Wang, Xiaogang Li, Ruochen Gao, Peiyao Xiao, Chengliang Xu, Zeyu Wang, Zichao Chen, Bing Zhao, Hu Wei · 2026-06-01 04:00

BilliardPhys-Bench：多模态大语言模型物理推理与视觉动态基准测试

arXiv:2605.30900v1 Announce Type: new Abstract: Current multimodal models handle static image recognition well, but intuitive physical reasoning remains a weakness. Predicting how objects will move and interact from a single image is still difficult for these systems. We present …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-01 02:52

TRON：视觉推理强化学习的定向规则可验证在线环境

TRON enables scalable and controllable reinforcement learning for visual reasoning through an online environment substrate that generates unlimited diverse training instances with verifiable answers.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-01 00:00

VLMs通过自适应测试时优化成为视频推理的优秀教师

Video generation models combined with vision-language models acting as test-time teachers through differentiable rewards achieve superior video reasoning performance.

arXiv cs.CL TIER_1 English(EN) · David Schlangen · 2026-05-29 14:51

多轮多智能体对话协作重建可略微提升VLM在空间推理上的性能

Robots operating in diverse environments rely on visual input to interpret objects and spatial layouts. In human-collaborative tasks, they are expected to communicate this understanding through language. Vision-language models (VLMs) support robotic tasks involving visual interpr…

arXiv cs.AI TIER_1 English(EN) · Pan Hui · 2026-05-29 10:59

SpatialAct：探究视觉语言模型（VLM）智能体在三维场景中的空间推理到行动能力

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-c…

arXiv cs.AI TIER_1 English(EN) · Zhe Qian, Nianbing Su, Zhonghua Wang, Hebei Li, Zhongxing Xu, Yueying Li, Fei Luo, Zhuohan Ouyang, Yanbiao Ma · 2026-05-29 04:00

SVSR：多模态推理的自验证与自纠正范式

arXiv:2604.10228v2 Announce Type: replace Abstract: Current multimodal models often suffer from shallow reasoning, leading to errors caused by incomplete or inconsistent thought processes. To address this limitation, we propose Self-Verification and Self-Rectification (SVSR), a u…

arXiv cs.AI TIER_1 English(EN) · Wanhao Liu, Jiaqing Xie, Qian Tan, Weida Wang, Jue Wang, Ran Sun, Zhuo Yang, Wanli Ouyang, Lei Bai, Tianfan Fu, Lu Chen, Xin Chen, Yuqiang Li · 2026-05-29 04:00

OmniMatBench：一个跨越19个材料科学子领域的、经人类校准的多模态推理基准

arXiv:2605.29833v1 Announce Type: new Abstract: As multimodal language models play an increasingly important role in scientific research, materials science offers a critical testbed due to its interdisciplinary, multimodal, and application-driven nature. However, existing materia…

arXiv cs.AI TIER_1 English(EN) · Yang He, Xiao Ding, Bibo Cai, Yufei Zhang, Kai Xiong, Zhouhao Sun, Bing Qin, Ting Liu · 2026-05-29 04:00

DeepTool：通过过程监督强化学习扩展工具集成推理中的交错审议

arXiv:2605.29568v1 Announce Type: new Abstract: Tool-Integrated Reasoning (TIR) extends LLM capabilities by leveraging external environments. However, existing methods lack the deliberation during sequential tool invocation required for strategic planning and self-correction. Whi…

arXiv cs.AI TIER_1 English(EN) · Jun Liu, Pu Zhao, Zhenglun Kong, Xuan Shen, Peiyan Dong, Fan Yang, Lin Cui, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Gaowen Liu, Yanzhi Wang, Dong Huang · 2026-05-29 04:00

机器人何时应思考？通过强化学习实现资源感知的推理，用于具身机器人决策

arXiv:2603.16673v4 Announce Type: replace-cross Abstract: Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decision-making during interactions with the environment. However, invoking LLM reasoning…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-29 00:00

SpatialAct：探究视觉语言模型（VLM）智能体在三维场景中的空间推理到行动能力

Vision-language models demonstrate strong performance on isolated spatial reasoning tasks but fail to maintain coherent spatial understanding and reliable actions during multi-turn interactive feedback in 3D environments.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-29 00:00

iVGR：利用强化学习实现多模态大语言模型的视觉基础推理内部化

A reinforcement learning framework called iVGR is introduced to transfer visual localization capabilities into textual reasoning, improving fine-grained perception in multimodal language models without requiring explicit visual grounding during inference.

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Chun Yuan · 2026-05-28 09:09

AgentCVR：通过脚本模拟强化学习实现主动多智能体跨视频推理

Cross-Video Reasoning (CVR) has emerged as a critical frontier in multimodal intelligence, requiring models to retrieve, align, and aggregate evidence distributed across multiple videos. Current Multimodal Large Language Models (MLLMs) often struggle with CVR, as simple single-pa…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-28 08:17

DeepTool：通过过程监督强化学习扩展工具集成推理中的交错审议

Tool-Integrated Reasoning (TIR) extends LLM capabilities by leveraging external environments. However, existing methods lack the deliberation during sequential tool invocation required for strategic planning and self-correction. While RL mitigates this, conventional approaches fo…

arXiv cs.AI TIER_1 English(EN) · Yang Zhang, Xiaoshuai Sun, Rui Zhao, Wujin Sun, Yidong Chen, Jiayi Ji, Qian Chen, Rongrong Ji · 2026-05-28 04:00

按需观察：多模态推理中视觉证据获取的认知调度框架

arXiv:2605.28160v1 Announce Type: new Abstract: Existing multimodal reasoning approaches predominantly follow two paradigms: converting visual inputs into text prior to reasoning, or performing end-to-end reasoning within a unified vision-language representation space. Despite th…

arXiv cs.AI TIER_1 English(EN) · Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, Yaojie Lu · 2026-05-28 04:00

Vision-OPD：通过 On-Policy 自我蒸馏学习多模态大模型精细细节

arXiv:2605.18740v3 Announce Type: replace-cross Abstract: Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: t…

arXiv cs.AI TIER_1 English(EN) · Jiawei Kong, Hao Fang, Shunxiang Liao, Jinyu Li, Bin Chen, Hao Wu, Shu-Tao Xia, Min Zhang · 2026-05-28 04:00

推理至关重要：通过推理条件偏好优化减轻多模态大型推理模型的幻觉

arXiv:2605.27906v1 Announce Type: new Abstract: Multimodal Large Reasoning Models introduce the reasoning paradigm, demonstrating strong capabilities on complex vision-language tasks. However, they still suffer from severe hallucinations. Existing training-based methods typically…

arXiv cs.CL TIER_1 English(EN) · Minki Kang, Shizhe Diao, Ryo Hachiuma, Sung Ju Hwang, Pavlo Molchanov, Yu-Chiang Frank Wang, Byung-Kwan Lee · 2026-05-28 04:00

面向多模态Agent推理的Agent探索性策略优化

arXiv:2605.28774v1 Announce Type: new Abstract: Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behavior…

arXiv cs.CL TIER_1 English(EN) · Byung-Kwan Lee · 2026-05-27 17:36

面向多模态Agent推理的Agent探索性策略优化

Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the sel…

arXiv cs.AI TIER_1 English(EN) · Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum · 2026-05-27 04:00

Athena：通过数据高效过程奖励模型增强多模态推理

arXiv:2506.09532v5 Announce Type: replace-cross Abstract: We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time …

arXiv cs.CL TIER_1 English(EN) · Jizheng Ma, Xiaofei Zhou, Geyuan Zhang, Yanlong Song, Han Yan · 2026-05-27 04:00

LaRe: Latent Refocusing for Multimodal Reasoning

arXiv:2511.02360v4 Announce Type: replace-cross Abstract: Chain of Thought (CoT) reasoning enhances logical performance by decomposing complex tasks, yet its multimodal extension faces a trade-off. The prevailing Thinking with Images paradigm achieves visual refocusing by explici…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-27 00:00

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

Agents using vision-language models with extended reasoning face challenges in tool utilization, which are addressed through AXPO, a method that improves performance by optimizing thinking prefixes and tool call resampling.

量子位 (QbitAI) TIER_1 中文(ZH) · 克雷西 · 2026-05-26 10:17

引入DSA注意力机制至多模态，快手可业2.0开启增强推理新范式

光影之间，读懂未尽之意

arXiv cs.CL TIER_1 English(EN) · Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan · 2026-05-26 04:00

Agent-X：评估以视觉为中心的代理任务中的深度多模态推理

arXiv:2505.24876v2 Announce Type: replace-cross Abstract: Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic,…

arXiv cs.AI TIER_1 English(EN) · Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, Xiangxiang Chu · 2026-05-25 04:00

面向多模态推理的视觉引导策略优化

arXiv:2604.09349v2 Announce Type: replace-cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visua…

arXiv cs.CL TIER_1 English(EN) · Changyuan Tian, Zhicong Lu, Huaxing Liu, Xiang Wang, Shuai Li, Yu Chen, Wenqian Lv, Zichuan Lin, Juncheng Diao, Deheng Ye · 2026-05-22 04:00

Faithful-MR1：通过锚定和增强视觉注意力实现忠实的模态推理

arXiv:2605.22072v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This trans…

arXiv cs.CL TIER_1 English(EN) · Deheng Ye · 2026-05-21 07:10

Faithful-MR1: 通过锚定和增强视觉注意力实现忠实的模态推理

Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge:…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-21 00:00

看见的代价：在整体范式内实现值得信赖的多模态推理

Vision-Language Models often fail to faithfully synthesize multimodal data due to reliance on language priors over visual representation, necessitating new evaluation frameworks that prioritize semantic sufficiency over traditional multimodal gain metrics.

arXiv cs.CV TIER_1 English(EN) · Xu-Jing Ye, Yuan-Gen Wang, Ruping Wang · 2026-06-12 04:00

语言引导的抽象用于视觉推理

arXiv:2606.12847v1 Announce Type: new Abstract: The Abstraction and Reasoning Corpus (ARC) is viewed as a critical avenue to Artificial General Intelligence (AGI), as it enables models to learn abstract transformation rules from few-shot examples and then generalize to new tasks.…

arXiv cs.CV TIER_1 English(EN) · Masanari Oi, Koki Maeda, Ryuto Koike, Daisuke Oba, Nakamasa Inoue, Naoaki Okazaki · 2026-06-11 04:00

从通信到行动：多模态大语言模型中类人多图像空间推理能力

arXiv:2602.08735v3 Announce Type: replace Abstract: While multimodal large language models (MLLMs) have made substantial progress in single-image spatial reasoning, multi-image spatial reasoning, which requires integration of information from multiple viewpoints, remains challeng…

arXiv cs.CV TIER_1 English(EN) · Di He · 2026-06-10 06:49

Ouroboros-Spatial：为空间推理闭合数据-模型循环

Spatial reasoning remains a persistent challenge for multimodal large language models (MLLMs). Existing approaches largely rely on large-scale, statically curated datasets, where all training samples are treated uniformly regardless of the model's evolving capabilities. This stat…

arXiv cs.CV TIER_1 English(EN) · Jiangchao Yao · 2026-06-10 05:52

推理，然后重新推理：跨视图 पुनरावलोकन 提升空间推理能力

Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable…

arXiv cs.CV TIER_1 English(EN) · Yiming Zhang, Ruoxuan Cao, Zhihang Zhong · 2026-06-10 04:00

CoCoSI：空间智能的协作认知地图构建

arXiv:2606.10401v1 Announce Type: new Abstract: Spatial intelligence is a key frontier for multimodal large language models (MLLMs), enabling them to reason about the physical world from visual experience. Inspired by human spatial cognition, recent approaches construct grid-base…

arXiv cs.CV TIER_1 English(EN) · Zhihang Zhong · 2026-06-09 04:20

CoCoSI：空间智能的协同认知地图构建

Spatial intelligence is a key frontier for multimodal large language models (MLLMs), enabling them to reason about the physical world from visual experience. Inspired by human spatial cognition, recent approaches construct grid-based cognitive maps from multi-frame visual inputs …

arXiv cs.CV TIER_1 English(EN) · Didi Zhu, Changrui Chen, Stefanos Zafeiriou, Jiankang Deng · 2026-06-09 04:00

VisualFLIP：多模态推理中的预测是否依赖于任务关键的视觉证据？

arXiv:2606.07872v1 Announce Type: new Abstract: When a multimodal large language model answers a visual reasoning question correctly, is the prediction actually supported by the task-critical visual evidence? Correct answers can coexist with flawed reasoning, making accuracy alon…

arXiv cs.CV TIER_1 English(EN) · Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Zizhao Tong, Xiaofeng Zhang, Xiaosong Yuan · 2026-06-09 04:00

Visual Para-Thinker++：用于视觉推理的单策略多智能体框架

arXiv:2606.09290v1 Announce Type: new Abstract: Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment and hallucination. We propose Visual Para-Thinker++, a single-p…

arXiv cs.CV TIER_1 English(EN) · Lianyu Hu, Xiaoyu Ma, Zeqin Liao, Yang Liu · 2026-06-09 04:00

TVI-CoT：用于多模态理解的文本-视觉交错思维链推理

arXiv:2606.08464v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning has proven effective for enhancing problem-solving in large language models. However, when applied to multimodal LLMs (MLLMs), existing CoT approaches suffer from a fundamental limitation: they perfo…

arXiv cs.CV TIER_1 English(EN) · Hangui Lin, Yan Shu, Zhengyang Liang, Chi Liu, Xiangrui Liu, Minghao Qin, Teng Long, Zheng Liu, Nicu Sebe · 2026-06-09 04:00

DyCo-RL：视觉推理的动态跨模态协调

arXiv:2606.08035v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a leading paradigm for enhancing visual reasoning in Multimodal Large Language Models (MLLMs). However, existing RLVR methods optimize primarily for the reasoning …

arXiv cs.CV TIER_1 English(EN) · Xiaosong Yuan · 2026-06-08 09:57

Visual Para-Thinker++：用于视觉推理的单一策略多智能体框架

Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment and hallucination. We propose Visual Para-Thinker++, a single-policy multi-agent framework in which one shared …

arXiv cs.CV TIER_1 English(EN) · Haoyuan Li, Zhengdong Hu, Jun Wang, Hehe Fan, Yi Yang · 2026-06-08 04:00

Skill-3D：为具身三维空间推理演进场景感知技能

arXiv:2606.07436v1 Announce Type: new Abstract: This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradi…

arXiv cs.CV TIER_1 English(EN) · Yida Yin, Harish Krishnakumar, Chung Peng Lee, Boya Zeng, Wenhao Chai, Shengbang Tong, Wenhu Chen, Hu Xu, Xingyu Fu, Gabriel Sarch, Aleksandra Korolova, Zhuang Liu · 2026-06-08 04:00

WorldBench：一个具有挑战性且视觉多样化的多模态推理基准

arXiv:2606.06538v1 Announce Type: new Abstract: In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs…

arXiv cs.CV TIER_1 English(EN) · Yi Yang · 2026-06-05 16:33

Skill-3D：为代理式3D空间推理演进场景感知技能

This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic str…

arXiv cs.CV TIER_1 English(EN) · Moshiur Farazi, Sameera Ramasinghe, Mahbub Ahmed Turza, Shafin Rahman · 2026-06-05 04:00

HyperVis: 洛伦兹双曲面上连续潜在视觉关系图用于组合推理

arXiv:2606.06100v1 Announce Type: new Abstract: Vision-Language Models (VLMs) struggle with compositional reasoning that requires understanding inter-object relationships. A natural remedy is to inject explicit scene graph triplets $\langle s, p, o \rangle$ from an off-the-shelf …

arXiv cs.CV TIER_1 English(EN) · Ma\"elic Neau, Salim Baloch, Jakob Suchan, Zoe Falomir, Mehul Bhatt · 2026-06-05 04:00

面向场景图生成的视觉常识驱动知识精炼

arXiv:2606.06369v1 Announce Type: new Abstract: Learning-driven Scene Graph Generation (SGG) models excel on frequent relation types but degrade sharply under annotation sparsity, failing to capture reliable visual commonsense knowledge. We propose a model-agnostic, semantically-…

arXiv cs.CV TIER_1 English(EN) · Chenming Zhu, Jingli Lin, Yilin Long, Peizhou Cao, Tai Wang, Jiangmiao Pang, Xihui Liu · 2026-06-05 04:00

以想象力思考：具身智能与世界模拟器的视觉空间推理

arXiv:2606.06476v1 Announce Type: new Abstract: While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infe…

arXiv cs.CV TIER_1 English(EN) · Haocheng Luo, Jiahui Liu, Ruicheng Zhang, Zhizhou Zhong, Jiaqi Huang, Zunnan Xu, Quan Shi, Jun Zhou, Xiu Li · 2026-06-05 04:00

从符号状态中学习视觉空间规划，通过模态间隙感知自蒸馏

arXiv:2606.06076v1 Announce Type: cross Abstract: While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent st…

arXiv cs.CV TIER_1 English(EN) · Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig · 2026-06-05 04:00

潜在隐式视觉推理

arXiv:2512.21218v2 Announce Type: replace Abstract: While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning task…

arXiv cs.CV TIER_1 English(EN) · Xihui Liu · 2026-06-04 17:56

以想象力思考：具身智能与世界模拟器的视觉空间推理

While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consis…

arXiv cs.CV TIER_1 English(EN) · Mehul Bhatt · 2026-06-04 16:36

视觉常识驱动的知识细化用于场景图生成

Learning-driven Scene Graph Generation (SGG) models excel on frequent relation types but degrade sharply under annotation sparsity, failing to capture reliable visual commonsense knowledge. We propose a model-agnostic, semantically-guided knowledge refinement framework that syste…

arXiv cs.CV TIER_1 English(EN) · Shafin Rahman · 2026-06-04 12:40

HyperVis: 洛伦兹双曲面上连续潜在视觉关系图用于组合推理

Vision-Language Models (VLMs) struggle with compositional reasoning that requires understanding inter-object relationships. A natural remedy is to inject explicit scene graph triplets $\langle s, p, o \rangle$ from an off-the-shelf scene graph generator (SGG), but we show this ba…

arXiv cs.CV TIER_1 English(EN) · Xiu Li · 2026-06-04 12:13

通过模态间隙感知自蒸馏从符号状态学习视觉空间规划

While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over th…

arXiv cs.CV TIER_1 English(EN) · Hao Zhong, Muzhi Zhu, Shenyan Zeng, Anzhou Li, Cong Chen, Hua Geng, Duochao Shi, Wentao Ye, Tao Lin, Hao Chen, Chunhua Shen · 2026-06-03 04:00

通过宽基线匹配引发多模态大模型中的复杂空间推理

arXiv:2606.03577v1 Announce Type: new Abstract: Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language model…

arXiv cs.CV TIER_1 English(EN) · Chunhua Shen · 2026-06-02 12:46

通过宽基线匹配引发多模态大模型复杂空间推理能力

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. How…

arXiv cs.CV TIER_1 English(EN) · Wei Deng, Xianlin Zhang, Mengshi Qi · 2026-06-02 04:00

像鸽子一样积极探索：通过代理式视觉语言模型增强空间推理能力

arXiv:2606.02459v1 Announce Type: new Abstract: Enabling Vision-Language Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications. Moreover, reinforcement learning methods r…

arXiv cs.CV TIER_1 English(EN) · Qingyang Liu, Bingjie Gao, Canmiao Fu, Zhipeng Huang, Chen Li, Feng Wang, Shuochen Chang, Shaobo Wang, Yali Wang, Keming Ye, Jiangtong Li, Li Niu · 2026-06-02 04:00

突破双重瓶颈：将统一多模态模型演进为自适应交错视觉推理器

arXiv:2605.14709v2 Announce Type: replace Abstract: Recent unified models integrate multimodal understanding and generation within a single framework. However, an "understanding-generation gap" persists, where models can capture user intent but often fail to translate this semant…

arXiv cs.CV TIER_1 English(EN) · Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, Jonathan Tremblay · 2026-06-02 04:00

SpaceTools：通过双重交互式强化学习实现工具增强的空间推理

arXiv:2512.04069v2 Announce Type: replace Abstract: Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide…

arXiv cs.CV TIER_1 English(EN) · Junhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao, Pengfei Wan, Kun Gai, Jing Liao · 2026-06-02 04:00

VLMs通过自适应测试时优化成为视频推理的优秀教师

arXiv:2606.02564v1 Announce Type: new Abstract: The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often s…

arXiv cs.CV TIER_1 English(EN) · Jing Liao · 2026-06-01 17:54

VLMs通过自适应测试时优化成为视频推理的优秀教师

The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific r…

arXiv cs.CV TIER_1 English(EN) · Mengshi Qi · 2026-06-01 16:30

如鸽子般积极探索：通过代理式视觉语言模型增强空间推理能力

Enabling Vision-Language Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications. Moreover, reinforcement learning methods rely on sparse rewards, limiting their effectiven…

arXiv cs.CV TIER_1 English(EN) · Hengbo Xu, Shengjie Jin, Yanbiao Ma, Zhiwu Lu · 2026-06-01 04:00

VisionPulse：高效多模态推理的动态视觉稀疏性

arXiv:2605.31457v1 Announce Type: new Abstract: With the rapid advancement of large multimodal models (LMMs), inference-time overhead has become a key bottleneck for real-world deployment. Existing methods typically prune visual tokens at prefill, assuming the required visual evi…

arXiv cs.CV TIER_1 English(EN) · Chang-Bin Zhang, Yujie Zhong, Qiang Zhang, Kai Han · 2026-06-01 04:00

iVGR：通过强化学习实现多模态大语言模型（MLLM）的视觉基础推理内部化

arXiv:2605.31096v1 Announce Type: new Abstract: While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In thi…

arXiv cs.CV TIER_1 English(EN) · Zhiwu Lu · 2026-05-29 15:51

VisionPulse：高效多模态推理的动态视觉稀疏性

With the rapid advancement of large multimodal models (LMMs), inference-time overhead has become a key bottleneck for real-world deployment. Existing methods typically prune visual tokens at prefill, assuming the required visual evidence remains static during reasoning. However, …

arXiv cs.CV TIER_1 English(EN) · Kai Han · 2026-05-29 10:07

iVGR：通过强化学习实现多模态大语言模型（MLLM）的视觉基础推理内部化

While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating expli…

arXiv cs.CV TIER_1 English(EN) · Junzhe Zhang, Huixuan Zhang, Guirong Wang, Xingyao Zhang, Pei Liu, Lin Qu, Hu Wei, Xiaojun Wan · 2026-05-29 04:00

DMC-CF: 动态多模态反事实问答基准，用于因果推理

arXiv:2605.29339v1 Announce Type: new Abstract: With the rapid advancement of multimodal large language models (MLLMs), models have demonstrated increasingly powerful multimodal capabilities. However, whether MLLMs trained through statistical learning can truly understand the cau…

arXiv cs.CV TIER_1 English(EN) · Yaowu Fan, Tao Han, Dazhao Du, Andy J. Ma, Jia Wan · 2026-05-29 04:00

训练智能体，而非专家：学习利用异构专家进行多轮视觉推理

arXiv:2605.29894v1 Announce Type: new Abstract: Recent progress in computer vision has produced a wide range of powerful specialized models for detection, segmentation, counting, and other visual tasks. However, these models are usually optimized for isolated task formulations, m…

arXiv cs.CV TIER_1 English(EN) · Yilun Qiu, Jiahe Wang, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Chun Yuan · 2026-05-29 04:00

AgentCVR：通过脚本模拟强化学习实现主动多智能体跨视频推理

arXiv:2605.29643v1 Announce Type: new Abstract: Cross-Video Reasoning (CVR) has emerged as a critical frontier in multimodal intelligence, requiring models to retrieve, align, and aggregate evidence distributed across multiple videos. Current Multimodal Large Language Models (MLL…

arXiv cs.CV TIER_1 English(EN) · Jia Wan · 2026-05-28 13:18

训练智能体而非专家：学习利用异构专家进行多轮视觉推理

Recent progress in computer vision has produced a wide range of powerful specialized models for detection, segmentation, counting, and other visual tasks. However, these models are usually optimized for isolated task formulations, making it difficult to directly support general-p…

arXiv cs.CV TIER_1 English(EN) · Xuanzhao Dong, Wenhui Zhu, Peijie Qiu, Xiwen Chen, Xiaobing Yu, Xin Li, Zhipeng Wang, Shao Tang, Gen Li, Yujian Xiong, Hao Wang, Yanxi Chen, Prayag Tiwari, Yalin Wang · 2026-05-28 04:00

Mags-RL：通过代理强化学习戴上多模态大语言模型的放大镜，用于复杂场景推理

arXiv:2605.27960v1 Announce Type: new Abstract: Despite their popularity and success, Multimodal Large Language Models (MLLMs) often struggle to interpret images accurately, which limits their reasoning capability in complex scenarios (e.g., high object density and complex backgr…

arXiv cs.CV TIER_1 English(EN) · Wei Tang, Yanpeng Sun, Shan Zhang, Weihao Bo, Xiaofan Li, Piotr Koniusz, Wei Li, Na Zhao, Zechao Li · 2026-05-28 04:00

Artemis：面向感知策略学习的结构化视觉推理

arXiv:2512.01988v2 Announce Type: replace Abstract: Recent reinforcement-learning frameworks for visual perception policy usually incorporate intermediate reasoning chains expressed in natural language. Empirical observations indicate that such purely linguistic intermediate reas…

arXiv cs.CV TIER_1 English(EN) · Tianrun Xu, Yue Sun, Qixun Wang, Jingyi Lu, Yuan Wang, Tianren Zhang, Longteng Guo, Fengyun Rao, Jing Lyu, Feng Chen, Jing Liu · 2026-05-28 04:00

语义增强的潜在视觉推理

arXiv:2605.19342v2 Announce Type: replace Abstract: Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce laten…

arXiv cs.CV TIER_1 English(EN) · Yongjin Kim, Yoonjin Oh, Yerin Kim, Hyomin Kim, Jeeyoung Yun, Yujung Heo, Minjun Kim, Sungwoong Kim · 2026-05-27 04:00

FiRe: 增强图像生成的细粒度多模态推理

arXiv:2604.13491v3 Announce Type: replace Abstract: With the rapid progress of Multimodal Large Language Models (MLLMs), unified MLLMs that jointly perform image understanding and generation have advanced significantly. However, despite the inherent reasoning capabilities of unif…

arXiv cs.CV TIER_1 English(EN) · Karan Goyal · 2026-05-22 04:00

看见的代价：在整体范式内实现可信的多模态推理

arXiv:2604.20665v2 Announce Type: replace Abstract: The rapid proliferation of Vision-Language Models (VLMs) is often framed as enabling unified multimodal knowledge discovery but rests on an under-examined assumption: that current VLMs faithfully synthesise multimodal data. We a…

r/MachineLearning TIER_1 English(EN) · /u/Alternative_Art2984 · 2026-06-04 03:52

2026年最佳视觉推理模型（含API）[D]

<div class="md"><p>For example, suppose I have a one-hour video and I provide it to ChatGPT or another AI model. If I ask complex reasoning questions about the video, which models are best suited for long-horizon video understanding and reasoning? Which models can …

报道来源 [129]

相关实体

相关话题