Alibaba launches Qwen3.7-Plus multimodal agent model

X — Qwen (Alibaba) TIER_1 English(EN) · Alibaba_Qwen · 2026-06-01 17:54

👏👏 Introducing Qwen3.7-Plus — a multimodal agent model that unifies vision and language into one versatile agent foundation.

👏👏 Introducing Qwen3.7-Plus — a multimodal agent model that unifies vision and language into one versatile agent foundation. ✅ Multimodal interactive hybrid agent: unified GUI & CLI operation across visual and text tasks ✅ Versatile coding agent & productivity assistant …

Hugging Face Blog TIER_1 English(EN) · 2025-08-07 00:00

Vision Language Model Alignment in TRL ⚡️

Hugging Face Blog TIER_1 Dansk(DA) · 2025-05-12 00:00

Vision Language Models (Better, faster, stronger)

Hugging Face Blog TIER_1 Dansk(DA) · 2025-02-21 00:00

SigLIP 2: A Better Multilingual Vision Language Encoder

Hugging Face Blog TIER_1 English(EN) · 2024-12-05 00:00

Welcome PaliGemma 2 – New vision language models by Google

Hugging Face Blog TIER_1 English(EN) · 2024-11-26 00:00

SmolVLM - small yet mighty Vision Language Model

Hugging Face Blog TIER_1 English(EN) · 2024-07-10 00:00

Preference Optimization for Vision Language Models

Hugging Face Blog TIER_1 English(EN) · 2024-06-24 00:00

Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models

Hugging Face Blog TIER_1 English(EN) · 2024-05-14 00:00

PaliGemma – Google's Cutting-Edge Open Vision Language Model

Hugging Face Blog TIER_1 English(EN) · 2024-04-15 00:00

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

Hugging Face Blog TIER_1 English(EN) · 2024-04-11 00:00

Vision Language Models Explained

Hugging Face Blog TIER_1 English(EN) · 2023-08-22 00:00

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Langage Model

Hugging Face Blog TIER_1 English(EN) · 2023-06-29 00:00

Accelerating Vision-Language Models: BridgeTower on Habana Gaudi2

Hugging Face Blog TIER_1 English(EN) · 2023-02-03 00:00

A Dive into Vision-Language Models

arXiv cs.AI TIER_1 English(EN) · Renjie Gu, Kaichen Zhou, Yan Luo, Mengyu Wang · 2026-06-12 04:00

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

arXiv:2605.16713v2 Announce Type: replace-cross Abstract: Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reason…

arXiv cs.AI TIER_1 English(EN) · Animesh Tripathy, Aswanth Krishnan · 2026-06-12 04:00

Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

arXiv:2606.13156v1 Announce Type: cross Abstract: Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its pr…

arXiv cs.AI TIER_1 English(EN) · Baochang Ren, Xinjie Liu, Xi Chen, Yanshuo Liu, Chenxi Li, Daqi Gao, Zeqin Su, Jintao Xing, Zirui Xue, Rui Li, Xiangyu Zhao, Shuofei Qiao, Minting Pan, Wangmeng Zuo, Lei Bai, Dongzhan Zhou, Ningyu Zhang, Huajun Chen · 2026-06-12 04:00

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

arXiv:2606.13578v1 Announce Type: cross Abstract: Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, …

arXiv cs.AI TIER_1 English(EN) · Hyeonbeom Choi, Daechul Ahn, Youhan Lee, Taewook Kang, Seongwon Cho, Jonghyun Choi · 2026-06-12 04:00

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

arXiv:2602.04208v2 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS meth…

arXiv cs.AI TIER_1 English(EN) · Huajun Chen · 2026-06-11 17:03

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench …

arXiv cs.AI TIER_1 English(EN) · Angelo Moroncelli, Roberto Zanetti, Marco Maccarini, Loris Roveda · 2026-06-11 04:00

Vision-Language-Action Jump-Starting for Reinforcement Learning Robotic Agents

arXiv:2604.13733v2 Announce Type: replace-cross Abstract: Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor…

arXiv cs.AI TIER_1 English(EN) · Ahmadreza Jeddi, Minh Ngoc Le, Amirhossein Kazerouni, Hakki Can Karaimer, Hue Nguyen, Iqbal Mohomed, Michael Brudno, Alex Levinshtein, Konstantinos G. Derpanis, Babak Taati, Radek Grzeszczuk · 2026-06-11 04:00

AVIS: Adaptive Test-Time Scaling for Vision-Language Models

arXiv:2606.11576v1 Announce Type: cross Abstract: Modern Vision-Language Models (VLMs) benefit from chain-of-thought prompting and test-time scaling, but these gains often come with prohibitive inference cost due to large visual contexts and long decoding chains. We view this cos…

arXiv cs.AI TIER_1 English(EN) · Haoping Yu, Yuanxi Li, Jing Ma · 2026-06-11 04:00

From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning

arXiv:2606.11745v1 Announce Type: cross Abstract: Visual causal reasoning is essential for understanding and intervening in the physical world, requiring identification of causal variables from visual inputs and reasoning over intervention effects. Despite recent progress, large …

arXiv cs.AI TIER_1 English(EN) · Cheng-Yu Yang, Shao-Yuan Lo, Yu-Lun Liu · 2026-06-11 04:00

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

arXiv:2606.12412v1 Announce Type: cross Abstract: Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow…

arXiv cs.AI TIER_1 English(EN) · Peng Sun, Yi Yang, Huawen Shen, Yi Ban, Tianfan Fu, Yanbo Wang, Yuqiang Li · 2026-06-11 04:00

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

arXiv:2603.09715v2 Announce Type: replace Abstract: Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the …

arXiv cs.LG TIER_1 English(EN) · Narges Babadi, Hadis Karimipour · 2026-06-11 04:00

Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations

arXiv:2605.16651v2 Announce Type: replace-cross Abstract: Explanation mechanisms are increasingly used to support transparency and trust in vision-language models (VLMs), particularly in settings where model decisions require human oversight. However, the robustness of these expl…

arXiv cs.LG TIER_1 English(EN) · Hyun Joe Jeong, Gokul Swamy, Andrea Bajcsy · 2026-06-11 04:00

Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

arXiv:2606.12299v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models provide a natural language interface to robot control, but the mapping from language to behavior is often brittle and unintuitive: semantically similar instructions can induce drastically differ…

arXiv cs.LG TIER_1 English(EN) · Pankhuri Vanjani, Zhuoyue Li, Jakub Suliga, Moritz Reuss, Gianluca Geraci, Xinkai Jiang, Rudolf Lioutikov · 2026-06-11 04:00

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

arXiv:2606.12105v1 Announce Type: cross Abstract: Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at …

arXiv cs.LG TIER_1 English(EN) · Samuel Tetteh, Cody Fleming · 2026-06-11 04:00

Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models

arXiv:2606.11266v1 Announce Type: new Abstract: The cost signal that constrained-RL algorithms optimize against is almost always reactive: the simulator emits a non-zero cost only after a collision has begun, and the Lagrange multiplier of PPO-Lagrangian grows only after the epis…

arXiv cs.CL TIER_1 English(EN) · Xuan Dong, Zhe Han, Tianhao Niu, Qingfu Zhu, Wanxiang Che · 2026-06-11 04:00

When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models

arXiv:2606.11906v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have shown strong performance in language-conditioned robotic manipulation, yet their robustness to linguistic variation remains poorly understood. In this work, we present the first systematic mu…

arXiv cs.AI TIER_1 English(EN) · Jia Fu, Yongtao Wu, Yihang Chen, Kunyu Peng, Xiao Zhang, Volkan Cevher, Sepideh Pashami, Anders Holst · 2026-06-11 04:00

Diffusion-based Cumulative Adversarial Purification for Vision Language Models

arXiv:2506.03933v2 Announce Type: replace-cross Abstract: Vision Language Models (VLMs) have shown remarkable capabilities in multimodal understanding, yet their susceptibility to adversarial perturbations poses a significant threat to their reliability in real-world applications…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 00:00

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

LabVLA, a vision-language-action model trained with a two-stage approach combining action token pretraining and flow matching, demonstrates superior performance on laboratory automation tasks through simulated data generation and robot-specific learning.

arXiv cs.AI TIER_1 English(EN) · Yu-Lun Liu · 2026-06-10 17:59

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tok…

arXiv cs.LG TIER_1 English(EN) · Andrea Bajcsy · 2026-06-10 16:34

Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

Vision-Language-Action (VLA) models provide a natural language interface to robot control, but the mapping from language to behavior is often brittle and unintuitive: semantically similar instructions can induce drastically different behaviors, while some capabilities may not be …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 13:59

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and…

arXiv cs.LG TIER_1 English(EN) · Rudolf Lioutikov · 2026-06-10 13:59

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and…

arXiv cs.CL TIER_1 English(EN) · Wanxiang Che · 2026-06-10 10:36

When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models

Vision-Language-Action (VLA) models have shown strong performance in language-conditioned robotic manipulation, yet their robustness to linguistic variation remains poorly understood. In this work, we present the first systematic multilingual evaluation of VLA models by translati…

arXiv cs.AI TIER_1 English(EN) · Hyunwoong Kim, Seongeun Lee, Hannah Yun, Junhyun Park, Jonggwon Park · 2026-06-10 04:00

SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation

arXiv:2606.09871v1 Announce Type: cross Abstract: Group Relative Policy Optimization (GRPO) and its variants, originally developed for Large Language Models (LLMs), have recently been applied to Multimodal LLMs and produced strong results. However, their coarse-grained holistic c…

arXiv cs.CL TIER_1 English(EN) · Pratham Singla, Shivank Garg, Vihan Singh, Paras Chopra · 2026-06-10 04:00

Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

arXiv:2606.10400v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question's phrasing together with memorized world knowledge, rather than fro…

arXiv cs.AI TIER_1 English(EN) · Taishan Li, Jiwen Zhang, Siyuan Wang, Xuanjing Huang, Zhongyu Wei · 2026-06-10 04:00

LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

arXiv:2606.10862v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. This assumption often fails in realistic settings, where…

arXiv cs.AI TIER_1 English(EN) · Jonathan C. Kao, Jason Chan, Andy Wang · 2026-06-10 04:00

Flow Control: Steering Vision-Language-Action Models with Simple Real-Time Inputs

arXiv:2606.10180v1 Announce Type: cross Abstract: We introduce flow control of vision-language-action (VLA) models, a simple and effective way to steer VLA actions in real-time through generic inputs, such as a keyboard. This method can be used out-of-the-box and does not require…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 00:00

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Vision-language models can improve grounding performance under aggressive token reduction by replacing irreversible visual-token pruning with recoverable routing that allows tokens to re-enter the processing pipeline at later stages.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 00:00

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

World Pilot enhances Vision-Language-Action models by incorporating dynamic scene evolution and trajectory priors from a World-Action Model, achieving superior performance in zero-shot out-of-distribution manipulation tasks.

arXiv cs.AI TIER_1 English(EN) · Zhongyu Wei · 2026-06-09 13:39

LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. This assumption often fails in realistic settings, where occlusion makes manipulation partially observable…

arXiv cs.CL TIER_1 English(EN) · Paras Chopra · 2026-06-09 04:18

Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question's phrasing together with memorized world knowledge, rather than from the image itself, which inflates benchmark sco…

arXiv cs.AI TIER_1 English(EN) · Daniela Baiamonte, Elena Fano, Matteo Gabburo, Stefano Simonazzi, Leonardo Rigutini, Andrea Zugarini · 2026-06-09 04:00

Multilingual Training and Evaluation Resources for Vision-Language Models

arXiv:2604.18347v2 Announce Type: replace-cross Abstract: Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and m…

arXiv cs.AI TIER_1 English(EN) · Can Wang, Shengwei Wang, Bolin Zhang, Zhiying Tu, Dianhui Chu · 2026-06-09 04:00

An Effective Router for Vision-Language Model Selection

arXiv:2606.08970v1 Announce Type: new Abstract: Vision-language models (VLMs) with varying performance and resource requirements are widely deployed, making it difficult for users to select the most appropriate one among numerous VLM candidates. Existing work reveals the performa…

arXiv cs.AI TIER_1 English(EN) · Siyuan Liu, Jinyang Wu · 2026-06-09 04:00

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

arXiv:2606.09131v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a k…

arXiv cs.AI TIER_1 English(EN) · Youting Wang, Yuan Tang, Yitian Qian, Chen Zhao · 2026-06-09 04:00

VisualLeakBench: Reproducible Action-Boundary Propagation Failures in Vision-Language Agents

arXiv:2606.07595v1 Announce Type: cross Abstract: Vision-language agents increasingly consume screenshots, documents, and user interfaces before writing to memory, sending messages, or invoking external tools. We study a concrete failure mode in this setting: action-boundary prop…

arXiv cs.AI TIER_1 English(EN) · Hannah Gao (Massachusetts Institute of Technology), Dylan Hadfield-Menell (Massachusetts Institute of Technology), Rachel Ma (Massachusetts Institute of Technology) · 2026-06-09 04:00

A Dataset for Dynamic Human Preferences for Vision Language Models

arXiv:2606.07653v1 Announce Type: cross Abstract: Given the increased adoption of Vision Language Models (VLMs) in human-interactive settings, it is important that we evaluate how well these models can adapt to real-time preferences for different users. While an increasing number…

arXiv cs.AI TIER_1 English(EN) · Lujun Li, Lama Sleem, Niccolo Gentile, Yangjie Xu, Yewei Song, Wenbo Wu, Radu State · 2026-06-09 04:00

The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models

arXiv:2606.07861v1 Announce Type: cross Abstract: Recent vision-language models (VLMs) excel at multimodal understanding and reasoning, yet their fine-grained visual perception remains underexplored. A natural extension of ``How many r are there in Strawberry?'' asks: how small a…

arXiv cs.AI TIER_1 English(EN) · Khanh D. Nguyen, Hung T. Ho, Chinh T. Nguyen, Thanh Q. Duong, Linh D. Le, Duy M. H. Nguyen, Vien A. Ngo, An T. Le · 2026-06-09 04:00

vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models

arXiv:2606.08094v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) policies are typically shipped as Python/PyTorch stacks that assume a workstation-class GPU, a mismatch for the hardware on which robots actually run. We present vla.cpp, a portable C++ inference runti…

arXiv cs.AI TIER_1 English(EN) · Haihao Lin, Xiangsheng Huang, Xiao Yang, Weibang Zhou, Yiqi Zhang, Bo Yang, Simin Zeng, Jiawei Yang, Zhengyang Wang, Jiahui Du · 2026-06-09 04:00

FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

arXiv:2606.08653v1 Announce Type: cross Abstract: Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent …

arXiv cs.AI TIER_1 English(EN) · Yi Yu, Xinchuan Qiu · 2026-06-09 04:00

Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis

arXiv:2606.08881v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models have demonstrated strong generalization in robotic manipulation, yet existing evaluations are primarily conducted in simulation or on expensive robotic platforms, leaving their robustness on aff…

arXiv cs.AI TIER_1 English(EN) · Haodi Hu, Chung-Ta Huang, Jing Liu, Ye Wang, Kei Suzuki, Matthew Brand, Toshiaki Koike-Akino · 2026-06-09 04:00

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

arXiv:2606.09630v1 Announce Type: cross Abstract: Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery…

arXiv cs.AI TIER_1 English(EN) · Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan · 2026-06-09 04:00

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

arXiv:2602.21172v3 Announce Type: replace Abstract: Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, an…

arXiv cs.AI TIER_1 English(EN) · Soochang Song, Yongjune Kim · 2026-06-09 04:00

Collaborative Edge-to-Server Inference for Vision-Language Models

arXiv:2512.16349v2 Announce Type: replace-cross Abstract: We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge dev…

arXiv cs.AI TIER_1 English(EN) · Yixuan Du, Chenxiao Yu, Haoyan Xu, Ziyi Wang, Yue Zhao, Xiyang Hu · 2026-06-09 04:00

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

arXiv:2601.12263v2 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) integrate visual and textual knowledge into unified representations that increasingly underpin modern retrieval and recommendation systems. However, it remains unclear how reliably these model…

arXiv cs.LG TIER_1 English(EN) · Seongbin Park, Fan Zhang, Baharan Mirzasoleiman, Shahriar Talebi, Nader Sehatbakhsh · 2026-06-09 04:00

Your Model Already Knows: Attention-Guided Safety Filter for Vision-Language-Action Models

arXiv:2606.09749v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models have demonstrated impressive end-to-end performance across a variety of robotic manipulation tasks. However, these policies offer no guarantees against collisions with task-irrelevant objects in…

arXiv cs.LG TIER_1 English(EN) · Nader Sehatbakhsh · 2026-06-08 17:11

Your Model Already Knows: Attention-Guided Safety Filter for Vision-Language-Action Models

Vision-Language-Action (VLA) models have demonstrated impressive end-to-end performance across a variety of robotic manipulation tasks. However, these policies offer no guarantees against collisions with task-irrelevant objects in the scene. Existing safety filters sidestep this …

arXiv cs.AI TIER_1 English(EN) · Toshiaki Koike-Akino · 2026-06-08 15:29

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery framework that keeps a pretrained VLA policy froz…

arXiv cs.CL TIER_1 English(EN) · Jinyang Wu · 2026-06-08 07:28

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens dif…

arXiv cs.LG TIER_1 English(EN) · Kelly Cui, Nikhil Prakash, Shoval Messica, Ayush Raina, David Bau, Antonio Torralba, Tamar Rott Shaham · 2026-06-08 04:00

The Dual Mechanisms of Spatial Variable Binding in Vision-Language Models

arXiv:2603.22278v2 Announce Type: replace-cross Abstract: Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to bind objects with their properties and spatial relations. Yet it remains unclear where and how such as…

arXiv cs.AI TIER_1 English(EN) · Yifan Xu, Chao Zhang, Ruifei Ma, Fei Gao, Zhifei Yang, Jiaxing Qi, Zhipeng Chen · 2026-06-08 04:00

MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models

arXiv:2606.06853v1 Announce Type: cross Abstract: The new era has witnessed a remarkable capability to extend Vision-Language Models (VLMs) for tackling tasks of video understanding. While current VLMs excel at event- or story-level understanding, their ability to capture fine-gr…

arXiv cs.AI TIER_1 English(EN) · Ryan D'Cunha, Alejandro Lozano, Xiaoxiao Sun, Daniel Vela Jarquin, Min Woo Sun, Josiah Aklilu, James Burgess, Yuhui Zhang, Ryan Nayebi, Paola Avila, Robayo, Jin Ye, Ming Hu, Zhongying Deng, Junjun He, Xin Chen, Yue Yao, Robert Tibshirani, Jeffrey J. Nir… · 2026-06-08 04:00

MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models

arXiv:2606.06696v1 Announce Type: cross Abstract: Vision and language models (VLMs) hold immense promise to transform biomedical imaging workflows, from detecting lesions in chest X-rays to profiling cellular features in microscopy. Realizing this potential, however, requires rob…

arXiv cs.AI TIER_1 English(EN) · Daniele Savietto, Declan Campbell, Andr\'e Panisson, Marco Nurisso, Giovanni Petri, Jonathan D. Cohen, Alan Perotti · 2026-06-08 04:00

The Geometry of Representational Failures in Vision Language Models

arXiv:2602.07025v2 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) exhibit puzzling failures in multi-object visual tasks, such as hallucinating non-existent elements or failing to identify the most similar objects among distractions. While these errors mirro…

arXiv cs.LG TIER_1 Italiano(IT) · Runyu Zhou, Qi Zhang, Qixun Wang, Yisen Wang · 2026-06-08 04:00

Diagnosing Visual Ignorance in Vision-Language Models

arXiv:2606.06890v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) frequently rely on language priors, producing confident answers that are weakly grounded in visual evidence. While this behavior is widely observed, its internal mechanisms and its impact on benchmark…

arXiv cs.AI TIER_1 English(EN) · Sweta Mahajan, Sukrut Rao, Jiahao Xie, Alexander Koller, Bernt Schiele · 2026-06-08 04:00

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

arXiv:2606.07451v1 Announce Type: cross Abstract: Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent…

arXiv cs.AI TIER_1 English(EN) · Marcelo Sartori Locatelli, Fernando Tonucci, Jea Kwon, Luiz Felipe Vecchietti, Bryan Nathanael Wijaya, Cheng Yaw Low, Virgilio Almeida, Meeyoung Cha · 2026-06-08 04:00

Textual Supervision Enhances Geospatial Representations in Vision-Language Models

arXiv:2606.07172v1 Announce Type: cross Abstract: Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations a…

arXiv cs.AI TIER_1 English(EN) · Haoxiang Shi, Xiang Deng, Haoyu Zhang, Qiaohui Chu, Yaowei Wang, Liqiang Nie · 2026-06-08 04:00

Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

arXiv:2606.07244v1 Announce Type: cross Abstract: Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach\-es adopt a three-stage framework: a way…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-08 00:00

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Research reveals that vision and text tokens in multimodal models evolve asynchronously, leading to inefficient computation; a new asymmetric routing framework reduces visual processing overhead while maintaining performance.

arXiv cs.AI TIER_1 English(EN) · Boyang Zhang, Lianlei Shan · 2026-06-06 04:00

MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action

arXiv:2606.06245v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) policies remain brittle in long-horizon and high-uncertainty control, where one-pass action decoding provides limited inference-time deliberation. Explicit chain-of-thought can increase reasoning depth…

arXiv cs.AI TIER_1 English(EN) · Dong Jing, Jingchen Nie, Tianqi Zhang, Jiaqi Liu, Huaxiu Yao, Zhiwu Lu, Mingyu Ding · 2026-06-06 04:00

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

arXiv:2606.06491v1 Announce Type: cross Abstract: Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixe…

arXiv cs.CL TIER_1 English(EN) · Bernt Schiele · 2026-06-05 16:54

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an …

arXiv cs.CL TIER_1 English(EN) · Meeyoung Cha · 2026-06-05 11:40

Textual Supervision Enhances Geospatial Representations in Vision-Language Models

Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations acquired by three model families: vision-only archi…

arXiv cs.CL TIER_1 English(EN) · Minxin Chen, He Zhu, Junyou Su, Wen Wang, Yijie Deng, Wenjia Zhang · 2026-06-05 04:00

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

arXiv:2606.05744v1 Announce Type: new Abstract: Spatial planning maps are central to territorial governance, translating planning objectives, regulations, and spatial strategies into visual forms for decision-making, public communication, and institutional coordination. Their int…

arXiv cs.LG TIER_1 English(EN) · Sangwu Park, Wonjoong Kim, Yeonjun In, Sein Kim, Hongseok Kang, Chanyoung Park · 2026-06-05 04:00

Test-Time Training for Visual Foresight Vision-Language-Action Models

arXiv:2605.08215v2 Announce Type: replace-cross Abstract: Visual Foresight VLA (VF-VLA) has become a prominent architectural choice in the recent VLA due to its impressive performance. Nevertheless, the inherent design of VF-VLA makes it particularly vulnerable to out-of-distribu…

arXiv cs.LG TIER_1 English(EN) · Zhuoming Liu, Jinhong Lin, Kwan Man Cheng, Lin Zhang, Shayok Bagchi, Yin Li · 2026-06-05 04:00

DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

arXiv:2606.05758v1 Announce Type: cross Abstract: Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorl…

arXiv cs.LG TIER_1 English(EN) · Yitong Chen, Shiduo Zhang, Jingjing Gong, Xipeng Qiu · 2026-06-05 04:00

Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

arXiv:2606.05737v1 Announce Type: cross Abstract: Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy …

arXiv cs.CL TIER_1 English(EN) · Yi Ding, Ziliang Qiu, Bolian Li, Ruqi Zhang · 2026-06-05 04:00

Learning Self-Correction in Vision-Language Models via Rollout Augmentation

arXiv:2602.08503v2 Announce Type: replace-cross Abstract: Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerg…

arXiv cs.CL TIER_1 English(EN) · Mohammad Mahdi Abootorabi, Omid Ghahroodi, Anas Madkoor, Marzia Nouri, Doratossadat Dastgheib, Mohamed Hefeeda, Ehsaneddin Asgari · 2026-06-05 04:00

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

arXiv:2606.05531v1 Announce Type: cross Abstract: Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-05 00:00

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

TBD-VLA is a discrete vision-language-action framework that combines block diffusion with autoregressive generation to achieve efficient temporal action modeling and faster inference.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 17:59

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior effort…

arXiv cs.AI TIER_1 English(EN) · Mingyu Ding · 2026-06-04 17:59

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior effort…

arXiv cs.AI TIER_1 English(EN) · Lianlei Shan · 2026-06-04 14:48

MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action

Vision-Language-Action (VLA) policies remain brittle in long-horizon and high-uncertainty control, where one-pass action decoding provides limited inference-time deliberation. Explicit chain-of-thought can increase reasoning depth, but introduces token latency and an indirect tex…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 06:26

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the trai…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 05:58

Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 04:49

Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

Recent advancements in Vision-Language Models (VLMs) have significantly enhanced their ability to interpret complex visual semantics, yet their capacity for chronological reasoning remains under-explored. In this paper, we introduce a novel benchmark specifically designed to eval…

arXiv cs.LG TIER_1 English(EN) · Youqi Wu, Mohammad Jalali, Farzan Farnia · 2026-06-04 04:00

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

arXiv:2606.04180v1 Announce Type: new Abstract: Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not e…

arXiv cs.AI TIER_1 English(EN) · Boyuan Xiao, Bohong Chen, Yumeng Li, Ji Feng, Yao-Xiang Ding, Kun Zhou · 2026-06-04 04:00

Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

arXiv:2606.04046v1 Announce Type: cross Abstract: In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at long-ter…

arXiv cs.AI TIER_1 English(EN) · Tran Dinh Tien, Zhiqiang Shen · 2026-06-04 04:00

Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models

arXiv:2606.04922v1 Announce Type: cross Abstract: Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typicall…

arXiv cs.AI TIER_1 English(EN) · Elouan Gard\`es, Seung Eun Yi, Kartik Ahuja, Th\'eo Moutakanni, Huy V. Vo, Piotr Bojanowski, Wolfgang M. Pernice, Lo\"ic Landrieu, Camille Couprie · 2026-06-04 04:00

Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have

arXiv:2606.05107v1 Announce Type: cross Abstract: We propose a label-free approach to adapt powerful but generic vision foundation models to specialized scientific domains. Standard supervised fine-tuning is often ill-suited to these settings: labels are scarce, and task-specific…

arXiv cs.AI TIER_1 English(EN) · Enming Zhang, Jiayang Li, Yanlong Wang, Yanru Wu, Zhenyu Liu, Yang Li · 2026-06-04 04:00

EvoPrompt: Guided Prompt Evolution for Vision-Language Models Adaptation

arXiv:2603.09493v2 Announce Type: replace-cross Abstract: The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge. While parameter-efficient prompt learning methods offer a promising path, they ofte…

arXiv cs.CL TIER_1 Italiano(IT) · Manan Suri, Sarvesh Baskar, Dinesh Manocha · 2026-06-04 04:00

Video2LoRA: Parametric Video Internalization for Vision-Language Models

arXiv:2606.04351v1 Announce Type: cross Abstract: Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Video2LoRA, a method for parametric video internali…

arXiv cs.CL TIER_1 English(EN) · Zirui Wang, Junwei Yu, Adam Yala, David M. Chan, Joseph E. Gonzalez, Trevor Darrell · 2026-06-04 04:00

Stateful Visual Encoders for Vision-Language Models

arXiv:2606.04433v1 Announce Type: cross Abstract: Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language mo…

arXiv cs.CL TIER_1 English(EN) · Yong Cao, Chuqiao Li, Xianghui Xie, Gerard Pons-Moll, Andreas Geiger · 2026-06-04 04:00

NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

arXiv:2606.04773v1 Announce Type: cross Abstract: Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotat…

arXiv cs.CL TIER_1 English(EN) · Rui Sun, Zhecan Wang, Haoxuan You, Noel Codella, Kai-Wei Chang, Shih-Fu Chang · 2026-06-04 04:00

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

arXiv:2307.00862v3 Announce Type: replace-cross Abstract: Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for v…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 00:00

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

Research challenges the conventional wisdom in latent visual reasoning by demonstrating that cosine alignment between supervised latents and visual targets negatively correlates with model accuracy, while revealing that answers are decoded downstream from latents rather than with…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 00:00

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

BloomBench presents a cognitively grounded bilingual multimodal benchmark for Vision-Language Models, revealing significant cognitive asymmetries and cross-lingual performance gaps in current models.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 00:00

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

AffordanceVLA introduces a unified framework that uses structured affordance forecasting as an intermediate representation to improve the precision of perception-action mapping in robotic manipulation by leveraging vision-language models.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 00:00

DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

DRIFT is a framework that adapts pretrained vision-language models for continuous decoding tasks by combining coarse prediction with iterative refinement through flow matching, improving performance across perception and planning tasks.

arXiv cs.AI TIER_1 English(EN) · Camille Couprie · 2026-06-03 17:10

Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have

We propose a label-free approach to adapt powerful but generic vision foundation models to specialized scientific domains. Standard supervised fine-tuning is often ill-suited to these settings: labels are scarce, and task-specific training can collapse the model's generality and …

arXiv cs.LG TIER_1 English(EN) · Zhiqiang Shen · 2026-06-03 14:17

Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models

Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating a…

arXiv cs.CL TIER_1 English(EN) · Andreas Geiger · 2026-06-03 11:53

NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leavi…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-03 11:53

NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leavi…

arXiv cs.AI TIER_1 English(EN) · Anjie Liu, Yan Song, Zhixun Chen, Ziqin Gong, Zhongwei Yu, Jun Wang · 2026-06-03 04:00

ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents

arXiv:2606.03054v1 Announce Type: new Abstract: Tool-augmented vision-language agents can acquire external perceptual evidence through OCR, detection, segmentation, and other tools, but executing every proposed tool call is costly and sometimes unnecessary. We study the pre-call …

arXiv cs.AI TIER_1 English(EN) · Nikolaj Hindsbo, Sina Ehsani, Pragyana Mishra · 2026-06-03 04:00

SCOPE: Real-Time Natural Language Camera Agent at the Edge

arXiv:2606.02951v1 Announce Type: cross Abstract: Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with reproducible outcomes. Such agents must connect language models to callable perception and …

arXiv cs.AI TIER_1 English(EN) · Ying Tang, Dong Li, Youjia Zhang, Zikai Song, Junqing Yu, Wei Yang · 2026-06-03 04:00

PRISM: Synergizing Vision Foundation Models via Self-organized Expert Specialization

arXiv:2606.03444v1 Announce Type: cross Abstract: Unifying the complementary strengths of diverse Vision Foundation Models (VFMs) into a single efficient model is highly desirable but challenged by the negative transfer inherent in monolithic distillation. To address these featur…

arXiv cs.AI TIER_1 English(EN) · Ziyang Chen, Shaoguang Wang, Weiyu Guo, Qianyi Cai, He Zhang, Pengteng Li, Yiren Zhao, Yandong Guo · 2026-06-03 04:00

PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models

arXiv:2606.03598v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these models in open-ended environments requires continuously acquiring novel skills, a process …

arXiv cs.AI TIER_1 English(EN) · Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen · 2026-06-03 04:00

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model Enhancement

arXiv:2412.01282v2 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) bring powerful understanding and reasoning capabilities to multimodal tasks. Meanwhile, the great need for capable aritificial intelligence on mobile devices also arises, such as the AI assist…

arXiv cs.AI TIER_1 English(EN) · Xinpeng Dong, Min Zhang, Kairong Han, Xu Tan, Fei Wu, Kun Kuang · 2026-06-03 04:00

Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

arXiv:2605.18160v2 Announce Type: replace-cross Abstract: In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily attributed to effective paradigms for integrating visual and textual information. The dominant connector-based paradigm…

arXiv cs.CL TIER_1 English(EN) · Youssef Mohamed, Kenneth Ward Church, Mohamed Elhoseiny · 2026-06-03 04:00

Beyond Semantics: Modeling Factual and Affective Perceptual Experiences from Vision-Language Data

arXiv:2606.03345v1 Announce Type: cross Abstract: We present P-Topics (Perception Topics) modeling, a novel problem for understanding how images are perceived affectively and across cultures. The goal is to (1) discover and model the different perception experiences in a dataset …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-03 00:00

Stateful Visual Encoders for Vision-Language Models

Stateful visual encoders condition visual representations on prior features, improving visual comparison tasks in vision-language models.

Hugging Face Daily Papers TIER_1 Italiano(IT) · 2026-06-03 00:00

Video2LoRA: Parametric Video Internalization for Vision-Language Models

Video2LoRA enables efficient video processing in vision-language models by predicting Low-Rank Adaptation weights from video representations, reducing computational costs while maintaining video-faithful outputs.

arXiv cs.AI TIER_1 English(EN) · Yandong Guo · 2026-06-02 13:04

PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models

Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these models in open-ended environments requires continuously acquiring novel skills, a process that inevitably triggers severe catastrophic forge…

arXiv cs.CL TIER_1 English(EN) · Mohamed Elhoseiny · 2026-06-02 08:54

Beyond Semantics: Modeling Factual and Affective Perceptual Experiences from Vision-Language Data

We present P-Topics (Perception Topics) modeling, a novel problem for understanding how images are perceived affectively and across cultures. The goal is to (1) discover and model the different perception experiences in a dataset of images and captions, where each experience is d…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 08:54

Beyond Semantics: Modeling Factual and Affective Perceptual Experiences from Vision-Language Data

We present P-Topics (Perception Topics) modeling, a novel problem for understanding how images are perceived affectively and across cultures. The goal is to (1) discover and model the different perception experiences in a dataset of images and captions, where each experience is d…

arXiv cs.LG TIER_1 English(EN) · Jiaheng Hu, Jay Shim, Chen Tang, Yoonchang Sung, Bo Liu, Peter Stone, Roberto Martin-Martin · 2026-06-02 04:00

Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning

arXiv:2603.11653v2 Announce Type: replace Abstract: Continual Reinforcement Learning (CRL) for Vision-Language-Action (VLA) models is a promising direction toward self-improving embodied agents that can adapt in openended, evolving environments. However, conventional wisdom from …

arXiv cs.LG TIER_1 English(EN) · Luca M. Schulze Buschoff, Konstantinos Voudouris, Can Demircan, Eric Schulz · 2026-06-02 04:00

Can Vision Language Models Learn Intuitive Physics from Interaction?

arXiv:2602.06033v2 Announce Type: replace Abstract: Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not…

arXiv cs.AI TIER_1 English(EN) · Abhijith Babu, Ramneet Kaur, Nathaniel D. Bastian, Olivera Kotevska, Susmit Jha, Yanzhao Wu, Sumit Kumar Jha, Anirban Roy · 2026-06-02 04:00

Closed-Loop Neural Activation Control in Vision-Language-Action Models

arXiv:2606.00269v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models can be steered at test time by intervening on semantically meaningful internal directions, but existing methods use a fixed steering coefficient, effectively operating in open loop. This is poorly…

arXiv cs.LG TIER_1 English(EN) · Bing-Cheng Chuang, I-Hsuan Chu, Bor-Jiun Lin, YuanFu Yang, Min Sun, Chun-Yi Lee · 2026-06-02 04:00

The Lie We Tell: Correcting the Euclidean Fallacy in Vision Language Action Policies via Score Matching on Tangent Space

arXiv:2606.01847v1 Announce Type: cross Abstract: Diffusion-based Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the $\textbf{Euclidean Fallacy}$: representing SE(3) poses as flat $\mathbb{R}^{1…

arXiv cs.AI TIER_1 English(EN) · Zhiyuan Feng, Qixiu Li, Huizhi Liang, Rushuai Yang, Yichao Shen, Zhiying Du, Zhaowei Zhang, Yu Deng, Li Zhao, Hao Zhao, Zongqing Lu, Oier Mees, Marc Pollefeys, Jiaolong Yang, Baining Guo · 2026-06-02 04:00

From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data

arXiv:2606.00054v1 Announce Type: cross Abstract: Recent progress in generalizable embodied control has been driven by large-scale pretraining of Vision-Language-Action (VLA) models. However, most existing approaches rely on large collections of robot demonstrations, which are co…

arXiv cs.AI TIER_1 English(EN) · Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, Jianyu Chen · 2026-06-02 04:00

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

arXiv:2601.03309v2 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper r…

arXiv cs.AI TIER_1 English(EN) · Sangin Lee, Yukyung Choi · 2026-06-02 04:00

CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

arXiv:2605.13178v2 Announce Type: replace-cross Abstract: In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less infor…

arXiv cs.AI TIER_1 English(EN) · Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami, Alireza Heidarikhazaei, Zhou Weimin, Yong Zhang, Mohammad Akbari · 2026-06-02 04:00

From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

arXiv:2512.05277v3 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, with autonomous driving (AD) being one of the most safety-critical instances. Reliab…

arXiv cs.AI TIER_1 English(EN) · Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee · 2026-06-02 04:00

Understanding the Effects of Distractors on Reasoning Vision-Language Models

arXiv:2511.21397v2 Announce Type: replace-cross Abstract: How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior work on text-only language models has shown that textual distractors can intensify inverse scaling, causi…

arXiv cs.AI TIER_1 English(EN) · Zhisheng Chen, Tingyu Wu, Zijie Zhou, Zhengwei Xie, Jinhan Li, Ziyan Weng, Liang Lin, Jingwei Song, Zikai Xiao, Yingwei Zhang · 2026-06-02 04:00

PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Vision-Language Models

arXiv:2602.00415v2 Announce Type: replace Abstract: Memory is not merely a storage mechanism for intelligent systems, but a structure for organizing evidence and constraining belief. This is especially important for multimodal reasoning, where retrieved evidence must be both quer…

arXiv cs.CL TIER_1 English(EN) · Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom, Ji Woo Hong, Mark Hasegawa-Johnson, Qi Dai, Chong Luo, Chang D. Yoo · 2026-06-02 04:00

Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding

arXiv:2606.00564v1 Announce Type: cross Abstract: While on-policy distillation offers dense supervision for training small reasoning models, its optimization dynamics in the multimodal domain remain under-explored. In this work, we challenge the standard monolithic view of Vision…

arXiv cs.LG TIER_1 English(EN) · Haiyu Wang, Yutong Wang, Leshu Li, Yihui Ren, Sai Qian Zhang · 2026-06-02 04:00

LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models

arXiv:2606.00573v1 Announce Type: new Abstract: Vision-language models (VLMs) deliver strong multimodal reasoning capabilities, but their large computational cost and high parameter counts make deployment challenging on resource-constrained devices. Low-rank decomposition has eme…

arXiv cs.AI TIER_1 Italiano(IT) · Soroosh Tayebi Arasteh, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn · 2026-06-02 04:00

Cross-modal linkage risk in clinical vision-language models

arXiv:2606.02276v1 Announce Type: cross Abstract: Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve instance-level image-report correspondence. This poses a privacy risk in settings where radio…

arXiv cs.AI TIER_1 English(EN) · Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv · 2026-06-02 04:00

On the Limits of Token Reduction for Efficient Unified Vision Language Training

arXiv:2606.01503v1 Announce Type: cross Abstract: Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency pe…

arXiv cs.AI TIER_1 English(EN) · Sayeed Shafayet Chowdhury, Md. Shaown Miah · 2026-06-02 04:00

Detect Before You Leap: Mirage Detection in Vision-Language Models

arXiv:2606.00435v1 Announce Type: cross Abstract: Vision-language models (VLMs) can produce confident visual answers even when the required visual evidence is missing, blank, or unrelated to the question. This failure mode, known as mirage (Asadi et al. 2026), is especially conce…

arXiv cs.AI TIER_1 English(EN) · Haofan Cao, Zhaoyang Li, Zhichao You, Liang Guo, Tianrui Li · 2026-06-02 04:00

PaCo-VLA: Passivity-Shielded Compliance Prior for Contact-Rich Vision-Language-Action Manipulation

arXiv:2606.00515v1 Announce Type: cross Abstract: Contact-rich manipulation demands both high-level semantic reasoning and the safe regulation of high-frequency contact dynamics. While Vision-Language-Action (VLA) models provide unprecedented semantic generalization, their low-ra…

arXiv cs.AI TIER_1 English(EN) · Rashid Mushkani · 2026-06-02 04:00

Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated

arXiv:2606.00871v1 Announce Type: cross Abstract: Vision-language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetscape auditing, mapping, and public consultation. These uses combine observable attributes wit…

arXiv cs.LG TIER_1 English(EN) · Pau Montagut Bofi, Mario Garc\'ia Blasco, Tessa Pulli, Markus Vincze · 2026-06-02 04:00

Per-Group Error, Not Total MSE: Fine-Tuning Vision-Language-Action Models for 11-DoF Mobile Manipulation

arXiv:2606.00253v1 Announce Type: cross Abstract: Fine-tuning Vision-Language-Action (VLA) models for mobile manipulators with heterogeneous joint spaces can produce a counterintuitive result: the checkpoint with the lowest aggregate MSE is not the one that performs best on the r…

arXiv cs.AI TIER_1 English(EN) · Zijie Zhou, Dandan Zhu, Hangxiangpan Wang, Heng Zhang, Huishen Jiao, Yi Zhao · 2026-06-02 04:00

Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models

arXiv:2606.00275v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. Recent studies introduce Mixture of Experts (MoE) into LVLMs for improved compu…

arXiv cs.AI TIER_1 English(EN) · Yueh-Hua Wu, Tatsuya Matsushima, Kei Ota · 2026-06-02 04:00

Continuous Reasoning for Vision-Language-Action

arXiv:2606.00229v1 Announce Type: cross Abstract: Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continuous control. Text and explicit subgoals operate at task-level granularity, whereas vision-l…

arXiv cs.AI TIER_1 English(EN) · Kailing Li, Tianwen Qian, Lijin Yang, Yuqian Fu, Jingyu Gong, Xiaoling Wang, Liang He · 2026-06-02 04:00

Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation

arXiv:2606.00095v1 Announce Type: cross Abstract: Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision-language models (VLMs), a critical semantic-geometri…

arXiv cs.LG TIER_1 English(EN) · Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Tian Nian, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo · 2026-06-02 04:00

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

arXiv:2508.20072v4 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions autoregressively in a fixed left-to-right order …

arXiv cs.LG TIER_1 English(EN) · Yitong Jiang, Hongjun Wang, Collin McCarthy, Hanrong Ye, David Wehr, Xinhao Li, Qi Dou, Tianfan Xue, Ka Chun Cheung, Simon See, Wonmin Byeon, Ke Chen, Kai Han, Jinwei Gu, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Sifei Liu · 2026-06-02 04:00

Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders

arXiv:2606.00746v1 Announce Type: cross Abstract: Vision foundation models are bottlenecked by the quadratic cost of self-attention, which limits usable resolution and increases the cost of large-scale pretraining. Subquadratic alternatives such as linear attention and state-spac…

arXiv cs.LG TIER_1 English(EN) · Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick P\'erez, Raoul de Charette · 2026-06-02 04:00

Domain Adaptation with a Single Vision-Language Embedding

arXiv:2410.21361v2 Announce Type: replace-cross Abstract: Domain adaptation has been extensively investigated in computer vision but still requires access to target data at the training time, which might be difficult to obtain in real-world autonomous driving scenarios, especiall…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 00:00

MAOAM: Unified Object and Material Selection with Vision-Language Models

A unified vision-language model framework enables precise object and material selection through text or click interactions, supporting diverse editing workflows with improved robustness.

arXiv cs.AI TIER_1 Italiano(IT) · Daniel Truhn · 2026-06-01 14:01

Cross-modal linkage risk in clinical vision-language models

Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve instance-level image-report correspondence. This poses a privacy risk in settings where radiographs and reports are deliberately kept separate …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-01 07:59

The Lie We Tell: Correcting the Euclidean Fallacy in Vision Language Action Policies via Score Matching on Tangent Space

Diffusion-based Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the $\textbf{Euclidean Fallacy}$: representing SE(3) poses as flat $\mathbb{R}^{12}$ vectors. This approximation induces (1) manifo…

arXiv cs.CL TIER_1 English(EN) · Mihai Masala, Marius Leordeanu, Mihai Dascalu, Traian Rebedea · 2026-06-01 04:00

"In\^{t}elegi Rom\^ane\c{s}te?'' A Recipe for Romanian Vision-Language Models

arXiv:2605.31401v1 Announce Type: new Abstract: Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-resource languages, where neither large-scale image-text corpora nor culturally grounded evaluat…

arXiv cs.LG TIER_1 English(EN) · Yijie Tong, Yifan Hou, Shaobo Cui, Antoine Bosselut, Mrinmaya Sachan · 2026-06-01 04:00

Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models

arXiv:2605.30713v1 Announce Type: new Abstract: Test-time compute (TTC) strategies have emerged as a lightweight approach to boost reasoning in large language models (LLMs). However, their application and benefits for vision-language models (VLMs) remain underexplored. We present…

arXiv cs.AI TIER_1 English(EN) · Yuriel Ryan, Hei Man Ip, Adriel Kuek, Paul Pu Liang, Roy Ka-Wei Lee · 2026-06-01 04:00

Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models

arXiv:2605.08145v2 Announce Type: replace-cross Abstract: Current vision language models face hallucination and robustness issues against ambiguous or corrupted modalities. We hypothesize that these issues can be addressed by exploiting the shared information between modalities t…

arXiv cs.AI TIER_1 English(EN) · Taiyi Su, Jian Zhu, Tianjian Wang, Youzhang He, Zitai Huang, Jianjun Zhang, Chong Ma, Hanyang Wang, Tianjiao Zhang, Munan Yin, Weihao Ding, Yi Xu · 2026-06-01 04:00

DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

arXiv:2605.31286v1 Announce Type: cross Abstract: Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a r…

arXiv cs.AI TIER_1 English(EN) · Jun Wang, Xiaohao Xu, Xiaonan Huang · 2026-06-01 04:00

Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration

arXiv:2605.31196v1 Announce Type: cross Abstract: Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability…

arXiv cs.AI TIER_1 English(EN) · Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu, Vikas Chandra, Yangyang Shi · 2026-06-01 04:00

VLM3: Vision Language Models Are Native 3D Learners

arXiv:2605.30561v1 Announce Type: cross Abstract: Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision…

arXiv cs.CL TIER_1 Română(RO) · Traian Rebedea · 2026-05-29 15:04

Do you understand Romanian? A Recipe for Romanian Vision-Language Models

Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-resource languages, where neither large-scale image-text corpora nor culturally grounded evaluations exist. We present a systematic study of bui…

arXiv cs.AI TIER_1 English(EN) · Yi Xu · 2026-05-29 13:20

DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handl…

arXiv cs.AI TIER_1 English(EN) · Xiaonan Huang · 2026-05-29 12:04

Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration

Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability collision grounding: binding visual observations …

arXiv cs.AI TIER_1 English(EN) · Chung-En Johnny Yu, Brian Jalaian, Nathaniel D. Bastian · 2026-05-29 04:00

SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems

arXiv:2603.23853v3 Announce Type: replace Abstract: Combining multiple Vision-Language Models (VLMs) can enhance multimodal reasoning and robustness, but aggregating heterogeneous models' outputs amplifies uncertainty and increases the risk of hallucinations. We propose SCoOP (Se…

arXiv cs.AI TIER_1 English(EN) · Jianxin Lin, Chunzheng Zhu, Peter J. Kneuertz, Yunfei Bai, Yuan Xue · 2026-05-29 04:00

When Models Learn to Ask Why: Adaptive Causal Reasoning for Trustworthy Medical Vision-Language Models

arXiv:2603.23085v2 Announce Type: replace Abstract: Vision-Language Models (VLMs) have enabled interpretable medical diagnosis by integrating visual perception with linguistic reasoning. Yet, existing medical chain-of-thought (CoT) models lack explicit mechanisms to represent and…

arXiv cs.LG TIER_1 English(EN) · Taeyoung Kim, Jimin Lee, Myungkyu Koo, Dongyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, Jinwoo Shin · 2026-05-29 04:00

Contrastive Representation Regularization for Vision-Language-Action Models

arXiv:2510.01711v3 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) models have shown strong capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain sub…

arXiv cs.AI TIER_1 English(EN) · Shengyu Si, Yuanzhuo Lu, Ruimeng Yang, Ziyi Ye, Zuxuan Wu, Yu-Gang Jiang · 2026-05-29 04:00

VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models

arXiv:2605.29562v1 Announce Type: cross Abstract: Vision-Language-Action~(VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to generalize to unseen tasks that necessitate transferring relevant experience across objects, scen…

arXiv cs.LG TIER_1 English(EN) · Mohammadreza Teymoorianfard, Jean-Philippe Monteuuis, Jonathan Petit, Amir Houmansadr · 2026-05-29 04:00

ReasonBreak: Probing Vulnerabilities in Reasoning-Enabled Vision-Language-Action Models for Autonomous Driving

arXiv:2605.29114v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models with integrated reasoning have been proposed for end-to-end autonomous driving, assuming a tight coupling between reasoning and trajectory generation. However, the robustness of such systems und…

arXiv cs.LG TIER_1 English(EN) · Yilin Feng, Ahmed Burak Gulhan, Mahmut Taylan Kandemir · 2026-05-29 04:00

AsymVLM: Asymmetric Token Pruning for Efficient Vision-Language Model Inference

arXiv:2605.29535v1 Announce Type: new Abstract: Vision-Language Models (VLMs) process thousands of visual tokens per image alongside comparatively few text tokens, yet existing compression methods treat both modalities uniformly. We observe that the two modalities have fundamenta…

arXiv cs.AI TIER_1 English(EN) · Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai, Binhe Yu, Zheqi Lv, Haoyu Zheng, Jiaqi Zhu, Zhiqi Ge, Zixuan Wan, Siliang Tang, Yueting Zhuang · 2026-05-29 04:00

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

arXiv:2605.30011v1 Announce Type: cross Abstract: Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfer…

arXiv cs.CL TIER_1 English(EN) · Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun, Mingjia Zhang, Baobao Chang · 2026-05-29 04:00

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

arXiv:2411.14279v2 Announce Type: replace-cross Abstract: Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to d…

arXiv cs.CL TIER_1 English(EN) · Feng Han, Zhixiong Zhang, Zheming Liang, Yibin Wang, Jiaqi Wang · 2026-05-29 04:00

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

arXiv:2605.30265v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question…

arXiv cs.CL TIER_1 English(EN) · Amrita Mazumdar, Seonwook Park, Rajarshi Roy, Nikhil Srihari, Shengze Wang, Yuhao Zhou, Julia Wang, Koki Nagano, Shalini De Mello · 2026-05-29 04:00

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

arXiv:2605.30256v1 Announce Type: cross Abstract: Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent…

arXiv cs.AI TIER_1 English(EN) · Qian Chen, Xianyin Zhang, Yanzhi Liu, Lifan Guo, Feng Chen, Chi Zhang · 2026-05-29 04:00

Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset

arXiv:2605.29462v1 Announce Type: cross Abstract: The emergence of Large Vision-Language Models (LVLMs) has substantially expanded model capabilities beyond text-only understanding, enabling unified inference across both visual and textual modalities and supporting a broader rang…

arXiv cs.AI TIER_1 English(EN) · Yizhuo Lu, Changde Du, Qingyu Shi, Hang Chen, Jie Peng, Liuyun Jiang, Shuangchen Zhao, Huiguang He · 2026-05-29 04:00

Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion

arXiv:2605.29591v1 Announce Type: new Abstract: Modeling the interplay between external stimuli and internal neural representations is a pivotal research area for Brain-Computer Interfaces (BCIs). A major limitation of prior work is the prevailing paradigm of specialized, single-…

arXiv cs.CL TIER_1 English(EN) · Emmanuelle Bourigault · 2026-05-29 04:00

World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models

arXiv:2605.29585v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly used to answer questions about physical scenes, yet most evaluations reduce performance to a final answer. This hides whether the model perceived the right objects, represented the righ…

arXiv cs.CL TIER_1 English(EN) · Xueqing Wu, Yu-Chi Lin, Kai-Wei Chang, Nanyun Peng · 2026-05-29 04:00

On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training

arXiv:2605.29496v1 Announce Type: new Abstract: Post-training has greatly improved reasoning in frontier vision-language models, yet its gains for perception remain comparatively limited, creating a bottleneck for end-to-end visual reasoning. To investigate this gap, we introduce…

arXiv cs.AI TIER_1 English(EN) · Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele, Federico Tombari, Muhammad Ferjad Naeem · 2026-05-29 04:00

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

arXiv:2605.30126v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can ru…

arXiv cs.AI TIER_1 English(EN) · Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, Xuhong Huang, Pei Lin, Junyang Lin, Dayiheng Liu, Shuai Bai, Jingren Zhou, Jiazhao Zhang, Haoqi Yuan, Gengze Zhou, Hang Yin, Ye Wa… · 2026-05-29 04:00

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

arXiv:2605.30280v1 Announce Type: cross Abstract: Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embod…

arXiv cs.AI TIER_1 English(EN) · Haoyuan Shi, Xiancong Ren, Yingji Zhang, Qinfan Zhang, Jiayu Hu, Haozhe Shan, Han Dong, Jinpeng Lu, Yinda Chen, Yi Zhang, Yong Dai, Xiaozhu Ju · 2026-05-29 04:00

VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

arXiv:2605.30117v1 Announce Type: new Abstract: Understanding how Vision-Language-Action (VLA) models transform multimodal knowledge into embodied control remains an open challenge. We present VLA-Trace, a progressive diagnostic framework that analyzes VLA models through a unifie…

arXiv cs.AI TIER_1 English(EN) · Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Minjia Zhang, Junjie Hu · 2026-05-29 04:00

MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

arXiv:2507.09574v3 Announce Type: replace-cross Abstract: Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address the…

arXiv cs.LG TIER_1 English(EN) · Xingzhou Pang, Yifan Hou, Junling Wang, Mrinmaya Sachan · 2026-05-29 04:00

Unveiling the Visual Counting Bottleneck in Vision-Language Models

arXiv:2605.30170v1 Announce Type: cross Abstract: While Large Vision-Language Models (VLMs) excel at interpolation, they suffer catastrophic failures in systematic generalization, most notably in visual counting. In this work, we investigate this extrapolation bottleneck by decon…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-29 00:00

SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

Semantic Object Correspondence (SOCO) benchmark evaluates structured object understanding in vision models through consistent part-level annotations and keypoint descriptions, revealing gaps between language-grounded localization and visual correspondence while demonstrating stro…

arXiv cs.AI TIER_1 English(EN) · Xionghui Chen · 2026-05-28 17:36

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneo…

arXiv cs.CL TIER_1 English(EN) · Jiaqi Wang · 2026-05-28 17:27

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave …

arXiv cs.CL TIER_1 English(EN) · Shalini De Mello · 2026-05-28 17:20

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent interaction, agents must model full-duplex audiov…

arXiv cs.LG TIER_1 English(EN) · Mrinmaya Sachan · 2026-05-28 16:20

Unveiling the Visual Counting Bottleneck in Vision-Language Models

While Large Vision-Language Models (VLMs) excel at interpolation, they suffer catastrophic failures in systematic generalization, most notably in visual counting. In this work, we investigate this extrapolation bottleneck by deconstructing visual counting into three cognitive sta…

arXiv cs.AI TIER_1 English(EN) · Muhammad Ferjad Naeem · 2026-05-28 15:57

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, exist…

arXiv cs.AI TIER_1 English(EN) · Xiaozhu Ju · 2026-05-28 15:50

VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

Understanding how Vision-Language-Action (VLA) models transform multimodal knowledge into embodied control remains an open challenge. We present VLA-Trace, a progressive diagnostic framework that analyzes VLA models through a unified evidence chain from representation dynamics to…

arXiv cs.CL TIER_1 English(EN) · Marcell Fekete, Johannes Bjerva, Tam\'as K\'aldi · 2026-05-28 04:00

When Discourse Pressures Conflict: Information Structure in Vision-Language Model Outputs

arXiv:2605.28346v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly evaluated for whether they identify the right visual content, but little is known about whether they express such content in a discourse-appropriate form. We address this research gap u…

arXiv cs.AI TIER_1 English(EN) · Xiaomin Yu, Wenjie Zhang, Ziyue Qiao, Chengwei Qin, Hui Xiong · 2026-05-28 04:00

Text-Only Data Synthesis for Vision Language Model Training

arXiv:2503.22655v2 Announce Type: replace Abstract: Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question…

arXiv cs.AI TIER_1 English(EN) · Antonia Karamolegkou, Nicolas Angleraud, Beno\^it Sagot, Thibault Cl\'erice · 2026-05-28 04:00

Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

arXiv:2605.27750v1 Announce Type: cross Abstract: Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with tr…

arXiv cs.AI TIER_1 English(EN) · Semi Lee, Hyejin Go, Hyesong Choi · 2026-05-28 04:00

AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers

arXiv:2605.27465v1 Announce Type: cross Abstract: The quadratic cost of self-attention in Vision Transformers (ViTs) constitutes a fundamental bottleneck for practical deployment, motivating a vibrant line of research on token reduction. Among existing approaches, token merging (…

arXiv cs.AI TIER_1 English(EN) · Xucong Wang, Pengkun Wang, Zhe Zhao, Liheng Yu, Shuang Wang, Yang Wang · 2026-05-28 04:00

FedMPT: Federated Multi-label Prompt Tuning of Vision-Language Models

arXiv:2605.28347v1 Announce Type: new Abstract: Multi-Label Recognition (MLR) based on Vision-Language Models (VLMs) aims to leverage their pre-trained knowledge to better adapt complex recognition scenarios, thereby enhancing model robustness. However, for realistic decentralize…

arXiv cs.AI TIER_1 English(EN) · Fengze Yang, Bo Yu, Xuewen Luo, Cathy Liu, Chenxi Liu · 2026-05-28 04:00

CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

arXiv:2605.28115v1 Announce Type: new Abstract: Vision-Language Models (VLMs) face severe memory and latency bottlenecks due to high-resolution visual tokens. While current token reduction methods theoretically save FLOPs, post-hoc pruning introduces structural overhead, failing …

arXiv cs.CL TIER_1 English(EN) · Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Hao Wang, Xin Li, Yujian Xiong, Jiajun Cheng, Jingjing Wang, Xiaobing Yu, Haiyu Wu, Shao Tang, Zhipeng Wang, Langechuan Liu, Shan Lin, Oana Dumitrascu, Yalin Wang · 2026-05-28 04:00

OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models

arXiv:2605.27916v1 Announce Type: cross Abstract: The advancement of general medical Multimodal Large Language Models (MLLMs) has shown great potential for building conversational assistants to support clinical diagnosis. However, their adaptation to highly specialized domains su…

arXiv cs.CL TIER_1 English(EN) · Chinh Hoang, Mohammad Rashedul Hasan · 2026-05-28 04:00

The Abstraction Gap in Vision-Language Causal Reasoning

arXiv:2605.28779v1 Announce Type: new Abstract: Vision-language models (VLMs) generate fluent causal explanations, but current evaluations cannot distinguish linguistic plausibility from faithful causal reasoning. We introduce a dual-probe methodology that isolates these properti…

arXiv cs.LG TIER_1 English(EN) · Xinyu Wang, Mingze Li, Sicheng Lyu, Dongxiu Liu, Kaicheng Yang, Ziyu Zhao, Yufei Cui, Xiao-Wen Chang, Peng Lu · 2026-05-28 04:00

{\Omega}-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling

arXiv:2605.28803v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models unify perception, reasoning, and control within a single policy, yet their multi-billion-parameter backbones and diffusion-based action heads make on-device deployment prohibitively expensive. P…

arXiv cs.AI TIER_1 English(EN) · Guangyuan Li, Rongzhen Zhao, Jinhong Deng, Yanbo Wang, Joni Pajarinen · 2026-05-28 04:00

Object-Centric Vision Token Pruning for Vision Language Models

arXiv:2511.20439v2 Announce Type: replace-cross Abstract: In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM infere…