PulseAugur
EN
LIVE 11:48:51

Alibaba launches Qwen3.7-Plus multimodal agent model

Alibaba's Qwen team has released Qwen3.7-Plus, a new multimodal agent model designed to integrate vision and language capabilities for versatile agentic tasks. This release is part of a broader trend highlighted by Hugging Face, which features multiple new vision-language models and techniques. The platform showcases advancements like Google's PaliGemma 2, Microsoft's Florence-2, and Meta's Idefics2, alongside methods for aligning and optimizing these models. AI

IMPACT Alibaba's Qwen3.7-Plus release advances multimodal agent capabilities, while Hugging Face's featured models and techniques highlight broader progress in vision-language understanding and alignment.

RANK_REASON New multimodal agent model release from a major lab (Alibaba/Qwen).

Read on Hugging Face Blog →

AI-generated summary · Google Gemini · from 624 sources. How we write summaries →

Alibaba launches Qwen3.7-Plus multimodal agent model

COVERAGE [624]

  1. X — Qwen (Alibaba) TIER_1 English(EN) · Alibaba_Qwen ·

    👏👏 Introducing Qwen3.7-Plus — a multimodal agent model that unifies vision and language into one versatile agent foundation.

    👏👏 Introducing Qwen3.7-Plus — a multimodal agent model that unifies vision and language into one versatile agent foundation. ✅ Multimodal interactive hybrid agent: unified GUI & CLI operation across visual and text tasks ✅ Versatile coding agent & productivity assistant …

  2. Hugging Face Blog TIER_1 English(EN) ·

    Vision Language Model Alignment in TRL ⚡️

  3. Hugging Face Blog TIER_1 Dansk(DA) ·

    Vision Language Models (Better, faster, stronger)

  4. Hugging Face Blog TIER_1 Dansk(DA) ·

    SigLIP 2: A Better Multilingual Vision Language Encoder

  5. Hugging Face Blog TIER_1 English(EN) ·

    Welcome PaliGemma 2 – New vision language models by Google

  6. Hugging Face Blog TIER_1 English(EN) ·

    SmolVLM - small yet mighty Vision Language Model

  7. Hugging Face Blog TIER_1 English(EN) ·

    Preference Optimization for Vision Language Models

  8. Hugging Face Blog TIER_1 English(EN) ·

    Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models

  9. Hugging Face Blog TIER_1 English(EN) ·

    PaliGemma – Google's Cutting-Edge Open Vision Language Model

  10. Hugging Face Blog TIER_1 English(EN) ·

    Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

  11. Hugging Face Blog TIER_1 English(EN) ·

    Vision Language Models Explained

  12. Hugging Face Blog TIER_1 English(EN) ·

    Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Langage Model

  13. Hugging Face Blog TIER_1 English(EN) ·

    Accelerating Vision-Language Models: BridgeTower on Habana Gaudi2

  14. Hugging Face Blog TIER_1 English(EN) ·

    A Dive into Vision-Language Models

  15. arXiv cs.AI TIER_1 English(EN) · Renjie Gu, Kaichen Zhou, Yan Luo, Mengyu Wang ·

    GeoWorld-VLM: Geometry from World Models for Vision-Language Models

    arXiv:2605.16713v2 Announce Type: replace-cross Abstract: Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reason…

  16. arXiv cs.AI TIER_1 English(EN) · Animesh Tripathy, Aswanth Krishnan ·

    Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

    arXiv:2606.13156v1 Announce Type: cross Abstract: Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its pr…

  17. arXiv cs.AI TIER_1 English(EN) · Baochang Ren, Xinjie Liu, Xi Chen, Yanshuo Liu, Chenxi Li, Daqi Gao, Zeqin Su, Jintao Xing, Zirui Xue, Rui Li, Xiangyu Zhao, Shuofei Qiao, Minting Pan, Wangmeng Zuo, Lei Bai, Dongzhan Zhou, Ningyu Zhang, Huajun Chen ·

    LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

    arXiv:2606.13578v1 Announce Type: cross Abstract: Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, …

  18. arXiv cs.AI TIER_1 English(EN) · Hyeonbeom Choi, Daechul Ahn, Youhan Lee, Taewook Kang, Seongwon Cho, Jonghyun Choi ·

    SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

    arXiv:2602.04208v2 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS meth…

  19. arXiv cs.AI TIER_1 English(EN) · Huajun Chen ·

    LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

    Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench …

  20. arXiv cs.AI TIER_1 English(EN) · Angelo Moroncelli, Roberto Zanetti, Marco Maccarini, Loris Roveda ·

    Vision-Language-Action Jump-Starting for Reinforcement Learning Robotic Agents

    arXiv:2604.13733v2 Announce Type: replace-cross Abstract: Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor…

  21. arXiv cs.AI TIER_1 English(EN) · Ahmadreza Jeddi, Minh Ngoc Le, Amirhossein Kazerouni, Hakki Can Karaimer, Hue Nguyen, Iqbal Mohomed, Michael Brudno, Alex Levinshtein, Konstantinos G. Derpanis, Babak Taati, Radek Grzeszczuk ·

    AVIS: Adaptive Test-Time Scaling for Vision-Language Models

    arXiv:2606.11576v1 Announce Type: cross Abstract: Modern Vision-Language Models (VLMs) benefit from chain-of-thought prompting and test-time scaling, but these gains often come with prohibitive inference cost due to large visual contexts and long decoding chains. We view this cos…

  22. arXiv cs.AI TIER_1 English(EN) · Haoping Yu, Yuanxi Li, Jing Ma ·

    From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning

    arXiv:2606.11745v1 Announce Type: cross Abstract: Visual causal reasoning is essential for understanding and intervening in the physical world, requiring identification of causal variables from visual inputs and reasoning over intervention effects. Despite recent progress, large …

  23. arXiv cs.AI TIER_1 English(EN) · Cheng-Yu Yang, Shao-Yuan Lo, Yu-Lun Liu ·

    Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

    arXiv:2606.12412v1 Announce Type: cross Abstract: Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow…

  24. arXiv cs.AI TIER_1 English(EN) · Peng Sun, Yi Yang, Huawen Shen, Yi Ban, Tianfan Fu, Yanbo Wang, Yuqiang Li ·

    Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

    arXiv:2603.09715v2 Announce Type: replace Abstract: Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the …

  25. arXiv cs.LG TIER_1 English(EN) · Narges Babadi, Hadis Karimipour ·

    Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations

    arXiv:2605.16651v2 Announce Type: replace-cross Abstract: Explanation mechanisms are increasingly used to support transparency and trust in vision-language models (VLMs), particularly in settings where model decisions require human oversight. However, the robustness of these expl…

  26. arXiv cs.LG TIER_1 English(EN) · Hyun Joe Jeong, Gokul Swamy, Andrea Bajcsy ·

    Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

    arXiv:2606.12299v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models provide a natural language interface to robot control, but the mapping from language to behavior is often brittle and unintuitive: semantically similar instructions can induce drastically differ…

  27. arXiv cs.LG TIER_1 English(EN) · Pankhuri Vanjani, Zhuoyue Li, Jakub Suliga, Moritz Reuss, Gianluca Geraci, Xinkai Jiang, Rudolf Lioutikov ·

    DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

    arXiv:2606.12105v1 Announce Type: cross Abstract: Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at …

  28. arXiv cs.LG TIER_1 English(EN) · Samuel Tetteh, Cody Fleming ·

    Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models

    arXiv:2606.11266v1 Announce Type: new Abstract: The cost signal that constrained-RL algorithms optimize against is almost always reactive: the simulator emits a non-zero cost only after a collision has begun, and the Lagrange multiplier of PPO-Lagrangian grows only after the epis…

  29. arXiv cs.CL TIER_1 English(EN) · Xuan Dong, Zhe Han, Tianhao Niu, Qingfu Zhu, Wanxiang Che ·

    When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models

    arXiv:2606.11906v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have shown strong performance in language-conditioned robotic manipulation, yet their robustness to linguistic variation remains poorly understood. In this work, we present the first systematic mu…

  30. arXiv cs.AI TIER_1 English(EN) · Jia Fu, Yongtao Wu, Yihang Chen, Kunyu Peng, Xiao Zhang, Volkan Cevher, Sepideh Pashami, Anders Holst ·

    Diffusion-based Cumulative Adversarial Purification for Vision Language Models

    arXiv:2506.03933v2 Announce Type: replace-cross Abstract: Vision Language Models (VLMs) have shown remarkable capabilities in multimodal understanding, yet their susceptibility to adversarial perturbations poses a significant threat to their reliability in real-world applications…

  31. Hugging Face Daily Papers TIER_1 English(EN) ·

    LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

    LabVLA, a vision-language-action model trained with a two-stage approach combining action token pretraining and flow matching, demonstrates superior performance on laboratory automation tasks through simulated data generation and robot-specific learning.

  32. arXiv cs.AI TIER_1 English(EN) · Yu-Lun Liu ·

    Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

    Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tok…

  33. arXiv cs.LG TIER_1 English(EN) · Andrea Bajcsy ·

    Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

    Vision-Language-Action (VLA) models provide a natural language interface to robot control, but the mapping from language to behavior is often brittle and unintuitive: semantically similar instructions can induce drastically different behaviors, while some capabilities may not be …

  34. Hugging Face Daily Papers TIER_1 English(EN) ·

    DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

    Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and…

  35. arXiv cs.LG TIER_1 English(EN) · Rudolf Lioutikov ·

    DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

    Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and…

  36. arXiv cs.CL TIER_1 English(EN) · Wanxiang Che ·

    When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models

    Vision-Language-Action (VLA) models have shown strong performance in language-conditioned robotic manipulation, yet their robustness to linguistic variation remains poorly understood. In this work, we present the first systematic multilingual evaluation of VLA models by translati…

  37. arXiv cs.AI TIER_1 English(EN) · Hyunwoong Kim, Seongeun Lee, Hannah Yun, Junhyun Park, Jonggwon Park ·

    SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation

    arXiv:2606.09871v1 Announce Type: cross Abstract: Group Relative Policy Optimization (GRPO) and its variants, originally developed for Large Language Models (LLMs), have recently been applied to Multimodal LLMs and produced strong results. However, their coarse-grained holistic c…

  38. arXiv cs.CL TIER_1 English(EN) · Pratham Singla, Shivank Garg, Vihan Singh, Paras Chopra ·

    Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

    arXiv:2606.10400v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question's phrasing together with memorized world knowledge, rather than fro…

  39. arXiv cs.AI TIER_1 English(EN) · Taishan Li, Jiwen Zhang, Siyuan Wang, Xuanjing Huang, Zhongyu Wei ·

    LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

    arXiv:2606.10862v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. This assumption often fails in realistic settings, where…

  40. arXiv cs.AI TIER_1 English(EN) · Jonathan C. Kao, Jason Chan, Andy Wang ·

    Flow Control: Steering Vision-Language-Action Models with Simple Real-Time Inputs

    arXiv:2606.10180v1 Announce Type: cross Abstract: We introduce flow control of vision-language-action (VLA) models, a simple and effective way to steer VLA actions in real-time through generic inputs, such as a keyboard. This method can be used out-of-the-box and does not require…

  41. Hugging Face Daily Papers TIER_1 English(EN) ·

    Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

    Vision-language models can improve grounding performance under aggressive token reduction by replacing irreversible visual-token pruning with recoverable routing that allows tokens to re-enter the processing pipeline at later stages.

  42. Hugging Face Daily Papers TIER_1 English(EN) ·

    World Pilot: Steering Vision-Language-Action Models with World-Action Priors

    World Pilot enhances Vision-Language-Action models by incorporating dynamic scene evolution and trajectory priors from a World-Action Model, achieving superior performance in zero-shot out-of-distribution manipulation tasks.

  43. arXiv cs.AI TIER_1 English(EN) · Zhongyu Wei ·

    LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

    Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. This assumption often fails in realistic settings, where occlusion makes manipulation partially observable…

  44. arXiv cs.CL TIER_1 English(EN) · Paras Chopra ·

    Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

    Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question's phrasing together with memorized world knowledge, rather than from the image itself, which inflates benchmark sco…

  45. arXiv cs.AI TIER_1 English(EN) · Daniela Baiamonte, Elena Fano, Matteo Gabburo, Stefano Simonazzi, Leonardo Rigutini, Andrea Zugarini ·

    Multilingual Training and Evaluation Resources for Vision-Language Models

    arXiv:2604.18347v2 Announce Type: replace-cross Abstract: Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and m…

  46. arXiv cs.AI TIER_1 English(EN) · Can Wang, Shengwei Wang, Bolin Zhang, Zhiying Tu, Dianhui Chu ·

    An Effective Router for Vision-Language Model Selection

    arXiv:2606.08970v1 Announce Type: new Abstract: Vision-language models (VLMs) with varying performance and resource requirements are widely deployed, making it difficult for users to select the most appropriate one among numerous VLM candidates. Existing work reveals the performa…

  47. arXiv cs.AI TIER_1 English(EN) · Siyuan Liu, Jinyang Wu ·

    Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

    arXiv:2606.09131v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a k…

  48. arXiv cs.AI TIER_1 English(EN) · Youting Wang, Yuan Tang, Yitian Qian, Chen Zhao ·

    VisualLeakBench: Reproducible Action-Boundary Propagation Failures in Vision-Language Agents

    arXiv:2606.07595v1 Announce Type: cross Abstract: Vision-language agents increasingly consume screenshots, documents, and user interfaces before writing to memory, sending messages, or invoking external tools. We study a concrete failure mode in this setting: action-boundary prop…

  49. arXiv cs.AI TIER_1 English(EN) · Hannah Gao (Massachusetts Institute of Technology), Dylan Hadfield-Menell (Massachusetts Institute of Technology), Rachel Ma (Massachusetts Institute of Technology) ·

    A Dataset for Dynamic Human Preferences for Vision Language Models

    arXiv:2606.07653v1 Announce Type: cross Abstract: Given the increased adoption of Vision Language Models (VLMs) in human-interactive settings, it is important that we evaluate how well these models can adapt to real-time preferences for different users. While an increasing number…

  50. arXiv cs.AI TIER_1 English(EN) · Lujun Li, Lama Sleem, Niccolo Gentile, Yangjie Xu, Yewei Song, Wenbo Wu, Radu State ·

    The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models

    arXiv:2606.07861v1 Announce Type: cross Abstract: Recent vision-language models (VLMs) excel at multimodal understanding and reasoning, yet their fine-grained visual perception remains underexplored. A natural extension of ``How many r are there in Strawberry?'' asks: how small a…

  51. arXiv cs.AI TIER_1 English(EN) · Khanh D. Nguyen, Hung T. Ho, Chinh T. Nguyen, Thanh Q. Duong, Linh D. Le, Duy M. H. Nguyen, Vien A. Ngo, An T. Le ·

    vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models

    arXiv:2606.08094v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) policies are typically shipped as Python/PyTorch stacks that assume a workstation-class GPU, a mismatch for the hardware on which robots actually run. We present vla.cpp, a portable C++ inference runti…

  52. arXiv cs.AI TIER_1 English(EN) · Haihao Lin, Xiangsheng Huang, Xiao Yang, Weibang Zhou, Yiqi Zhang, Bo Yang, Simin Zeng, Jiawei Yang, Zhengyang Wang, Jiahui Du ·

    FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

    arXiv:2606.08653v1 Announce Type: cross Abstract: Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent …

  53. arXiv cs.AI TIER_1 English(EN) · Yi Yu, Xinchuan Qiu ·

    Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis

    arXiv:2606.08881v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models have demonstrated strong generalization in robotic manipulation, yet existing evaluations are primarily conducted in simulation or on expensive robotic platforms, leaving their robustness on aff…

  54. arXiv cs.AI TIER_1 English(EN) · Haodi Hu, Chung-Ta Huang, Jing Liu, Ye Wang, Kei Suzuki, Matthew Brand, Toshiaki Koike-Akino ·

    ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

    arXiv:2606.09630v1 Announce Type: cross Abstract: Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery…

  55. arXiv cs.AI TIER_1 English(EN) · Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan ·

    NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

    arXiv:2602.21172v3 Announce Type: replace Abstract: Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, an…

  56. arXiv cs.AI TIER_1 English(EN) · Soochang Song, Yongjune Kim ·

    Collaborative Edge-to-Server Inference for Vision-Language Models

    arXiv:2512.16349v2 Announce Type: replace-cross Abstract: We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge dev…

  57. arXiv cs.AI TIER_1 English(EN) · Yixuan Du, Chenxiao Yu, Haoyan Xu, Ziyi Wang, Yue Zhao, Xiyang Hu ·

    Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

    arXiv:2601.12263v2 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) integrate visual and textual knowledge into unified representations that increasingly underpin modern retrieval and recommendation systems. However, it remains unclear how reliably these model…

  58. arXiv cs.LG TIER_1 English(EN) · Seongbin Park, Fan Zhang, Baharan Mirzasoleiman, Shahriar Talebi, Nader Sehatbakhsh ·

    Your Model Already Knows: Attention-Guided Safety Filter for Vision-Language-Action Models

    arXiv:2606.09749v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models have demonstrated impressive end-to-end performance across a variety of robotic manipulation tasks. However, these policies offer no guarantees against collisions with task-irrelevant objects in…

  59. arXiv cs.LG TIER_1 English(EN) · Nader Sehatbakhsh ·

    Your Model Already Knows: Attention-Guided Safety Filter for Vision-Language-Action Models

    Vision-Language-Action (VLA) models have demonstrated impressive end-to-end performance across a variety of robotic manipulation tasks. However, these policies offer no guarantees against collisions with task-irrelevant objects in the scene. Existing safety filters sidestep this …

  60. arXiv cs.AI TIER_1 English(EN) · Toshiaki Koike-Akino ·

    ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

    Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery framework that keeps a pretrained VLA policy froz…

  61. arXiv cs.CL TIER_1 English(EN) · Jinyang Wu ·

    Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

    Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens dif…

  62. arXiv cs.LG TIER_1 English(EN) · Kelly Cui, Nikhil Prakash, Shoval Messica, Ayush Raina, David Bau, Antonio Torralba, Tamar Rott Shaham ·

    The Dual Mechanisms of Spatial Variable Binding in Vision-Language Models

    arXiv:2603.22278v2 Announce Type: replace-cross Abstract: Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to bind objects with their properties and spatial relations. Yet it remains unclear where and how such as…

  63. arXiv cs.AI TIER_1 English(EN) · Yifan Xu, Chao Zhang, Ruifei Ma, Fei Gao, Zhifei Yang, Jiaxing Qi, Zhipeng Chen ·

    MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models

    arXiv:2606.06853v1 Announce Type: cross Abstract: The new era has witnessed a remarkable capability to extend Vision-Language Models (VLMs) for tackling tasks of video understanding. While current VLMs excel at event- or story-level understanding, their ability to capture fine-gr…

  64. arXiv cs.AI TIER_1 English(EN) · Ryan D'Cunha, Alejandro Lozano, Xiaoxiao Sun, Daniel Vela Jarquin, Min Woo Sun, Josiah Aklilu, James Burgess, Yuhui Zhang, Ryan Nayebi, Paola Avila, Robayo, Jin Ye, Ming Hu, Zhongying Deng, Junjun He, Xin Chen, Yue Yao, Robert Tibshirani, Jeffrey J. Nir… ·

    MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models

    arXiv:2606.06696v1 Announce Type: cross Abstract: Vision and language models (VLMs) hold immense promise to transform biomedical imaging workflows, from detecting lesions in chest X-rays to profiling cellular features in microscopy. Realizing this potential, however, requires rob…

  65. arXiv cs.AI TIER_1 English(EN) · Daniele Savietto, Declan Campbell, Andr\'e Panisson, Marco Nurisso, Giovanni Petri, Jonathan D. Cohen, Alan Perotti ·

    The Geometry of Representational Failures in Vision Language Models

    arXiv:2602.07025v2 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) exhibit puzzling failures in multi-object visual tasks, such as hallucinating non-existent elements or failing to identify the most similar objects among distractions. While these errors mirro…

  66. arXiv cs.LG TIER_1 Italiano(IT) · Runyu Zhou, Qi Zhang, Qixun Wang, Yisen Wang ·

    Diagnosing Visual Ignorance in Vision-Language Models

    arXiv:2606.06890v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) frequently rely on language priors, producing confident answers that are weakly grounded in visual evidence. While this behavior is widely observed, its internal mechanisms and its impact on benchmark…

  67. arXiv cs.AI TIER_1 English(EN) · Sweta Mahajan, Sukrut Rao, Jiahao Xie, Alexander Koller, Bernt Schiele ·

    TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

    arXiv:2606.07451v1 Announce Type: cross Abstract: Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent…

  68. arXiv cs.AI TIER_1 English(EN) · Marcelo Sartori Locatelli, Fernando Tonucci, Jea Kwon, Luiz Felipe Vecchietti, Bryan Nathanael Wijaya, Cheng Yaw Low, Virgilio Almeida, Meeyoung Cha ·

    Textual Supervision Enhances Geospatial Representations in Vision-Language Models

    arXiv:2606.07172v1 Announce Type: cross Abstract: Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations a…

  69. arXiv cs.AI TIER_1 English(EN) · Haoxiang Shi, Xiang Deng, Haoyu Zhang, Qiaohui Chu, Yaowei Wang, Liqiang Nie ·

    Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

    arXiv:2606.07244v1 Announce Type: cross Abstract: Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach\-es adopt a three-stage framework: a way…

  70. Hugging Face Daily Papers TIER_1 English(EN) ·

    Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

    Research reveals that vision and text tokens in multimodal models evolve asynchronously, leading to inefficient computation; a new asymmetric routing framework reduces visual processing overhead while maintaining performance.

  71. arXiv cs.AI TIER_1 English(EN) · Boyang Zhang, Lianlei Shan ·

    MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action

    arXiv:2606.06245v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) policies remain brittle in long-horizon and high-uncertainty control, where one-pass action decoding provides limited inference-time deliberation. Explicit chain-of-thought can increase reasoning depth…

  72. arXiv cs.AI TIER_1 English(EN) · Dong Jing, Jingchen Nie, Tianqi Zhang, Jiaqi Liu, Huaxiu Yao, Zhiwu Lu, Mingyu Ding ·

    TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

    arXiv:2606.06491v1 Announce Type: cross Abstract: Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixe…

  73. arXiv cs.CL TIER_1 English(EN) · Bernt Schiele ·

    TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

    Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an …

  74. arXiv cs.CL TIER_1 English(EN) · Meeyoung Cha ·

    Textual Supervision Enhances Geospatial Representations in Vision-Language Models

    Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations acquired by three model families: vision-only archi…

  75. arXiv cs.CL TIER_1 English(EN) · Minxin Chen, He Zhu, Junyou Su, Wen Wang, Yijie Deng, Wenjia Zhang ·

    PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

    arXiv:2606.05744v1 Announce Type: new Abstract: Spatial planning maps are central to territorial governance, translating planning objectives, regulations, and spatial strategies into visual forms for decision-making, public communication, and institutional coordination. Their int…

  76. arXiv cs.LG TIER_1 English(EN) · Sangwu Park, Wonjoong Kim, Yeonjun In, Sein Kim, Hongseok Kang, Chanyoung Park ·

    Test-Time Training for Visual Foresight Vision-Language-Action Models

    arXiv:2605.08215v2 Announce Type: replace-cross Abstract: Visual Foresight VLA (VF-VLA) has become a prominent architectural choice in the recent VLA due to its impressive performance. Nevertheless, the inherent design of VF-VLA makes it particularly vulnerable to out-of-distribu…

  77. arXiv cs.LG TIER_1 English(EN) · Zhuoming Liu, Jinhong Lin, Kwan Man Cheng, Lin Zhang, Shayok Bagchi, Yin Li ·

    DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

    arXiv:2606.05758v1 Announce Type: cross Abstract: Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorl…

  78. arXiv cs.LG TIER_1 English(EN) · Yitong Chen, Shiduo Zhang, Jingjing Gong, Xipeng Qiu ·

    Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

    arXiv:2606.05737v1 Announce Type: cross Abstract: Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy …

  79. arXiv cs.CL TIER_1 English(EN) · Yi Ding, Ziliang Qiu, Bolian Li, Ruqi Zhang ·

    Learning Self-Correction in Vision-Language Models via Rollout Augmentation

    arXiv:2602.08503v2 Announce Type: replace-cross Abstract: Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerg…

  80. arXiv cs.CL TIER_1 English(EN) · Mohammad Mahdi Abootorabi, Omid Ghahroodi, Anas Madkoor, Marzia Nouri, Doratossadat Dastgheib, Mohamed Hefeeda, Ehsaneddin Asgari ·

    Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

    arXiv:2606.05531v1 Announce Type: cross Abstract: Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing…

  81. Hugging Face Daily Papers TIER_1 English(EN) ·

    TBD-VLA: Temporal Block Diffusion Vision Language Action Model

    TBD-VLA is a discrete vision-language-action framework that combines block diffusion with autoregressive generation to achieve efficient temporal action modeling and faster inference.

  82. Hugging Face Daily Papers TIER_1 English(EN) ·

    TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

    Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior effort…

  83. arXiv cs.AI TIER_1 English(EN) · Mingyu Ding ·

    TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

    Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior effort…

  84. arXiv cs.AI TIER_1 English(EN) · Lianlei Shan ·

    MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action

    Vision-Language-Action (VLA) policies remain brittle in long-horizon and high-uncertainty control, where one-pass action decoding provides limited inference-time deliberation. Explicit chain-of-thought can increase reasoning depth, but introduces token latency and an indirect tex…

  85. Hugging Face Daily Papers TIER_1 English(EN) ·

    Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

    Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the trai…

  86. Hugging Face Daily Papers TIER_1 English(EN) ·

    Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

    Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and…

  87. Hugging Face Daily Papers TIER_1 English(EN) ·

    Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

    Recent advancements in Vision-Language Models (VLMs) have significantly enhanced their ability to interpret complex visual semantics, yet their capacity for chronological reasoning remains under-explored. In this paper, we introduce a novel benchmark specifically designed to eval…

  88. arXiv cs.LG TIER_1 English(EN) · Youqi Wu, Mohammad Jalali, Farzan Farnia ·

    KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

    arXiv:2606.04180v1 Announce Type: new Abstract: Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not e…

  89. arXiv cs.AI TIER_1 English(EN) · Boyuan Xiao, Bohong Chen, Yumeng Li, Ji Feng, Yao-Xiang Ding, Kun Zhou ·

    Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

    arXiv:2606.04046v1 Announce Type: cross Abstract: In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at long-ter…

  90. arXiv cs.AI TIER_1 English(EN) · Tran Dinh Tien, Zhiqiang Shen ·

    Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models

    arXiv:2606.04922v1 Announce Type: cross Abstract: Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typicall…

  91. arXiv cs.AI TIER_1 English(EN) · Elouan Gard\`es, Seung Eun Yi, Kartik Ahuja, Th\'eo Moutakanni, Huy V. Vo, Piotr Bojanowski, Wolfgang M. Pernice, Lo\"ic Landrieu, Camille Couprie ·

    Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have

    arXiv:2606.05107v1 Announce Type: cross Abstract: We propose a label-free approach to adapt powerful but generic vision foundation models to specialized scientific domains. Standard supervised fine-tuning is often ill-suited to these settings: labels are scarce, and task-specific…

  92. arXiv cs.AI TIER_1 English(EN) · Enming Zhang, Jiayang Li, Yanlong Wang, Yanru Wu, Zhenyu Liu, Yang Li ·

    EvoPrompt: Guided Prompt Evolution for Vision-Language Models Adaptation

    arXiv:2603.09493v2 Announce Type: replace-cross Abstract: The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge. While parameter-efficient prompt learning methods offer a promising path, they ofte…

  93. arXiv cs.CL TIER_1 Italiano(IT) · Manan Suri, Sarvesh Baskar, Dinesh Manocha ·

    Video2LoRA: Parametric Video Internalization for Vision-Language Models

    arXiv:2606.04351v1 Announce Type: cross Abstract: Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Video2LoRA, a method for parametric video internali…

  94. arXiv cs.CL TIER_1 English(EN) · Zirui Wang, Junwei Yu, Adam Yala, David M. Chan, Joseph E. Gonzalez, Trevor Darrell ·

    Stateful Visual Encoders for Vision-Language Models

    arXiv:2606.04433v1 Announce Type: cross Abstract: Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language mo…

  95. arXiv cs.CL TIER_1 English(EN) · Yong Cao, Chuqiao Li, Xianghui Xie, Gerard Pons-Moll, Andreas Geiger ·

    NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

    arXiv:2606.04773v1 Announce Type: cross Abstract: Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotat…

  96. arXiv cs.CL TIER_1 English(EN) · Rui Sun, Zhecan Wang, Haoxuan You, Noel Codella, Kai-Wei Chang, Shih-Fu Chang ·

    UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

    arXiv:2307.00862v3 Announce Type: replace-cross Abstract: Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for v…

  97. Hugging Face Daily Papers TIER_1 English(EN) ·

    Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

    Research challenges the conventional wisdom in latent visual reasoning by demonstrating that cosine alignment between supervised latents and visual targets negatively correlates with model accuracy, while revealing that answers are decoded downstream from latents rather than with…

  98. Hugging Face Daily Papers TIER_1 English(EN) ·

    Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

    BloomBench presents a cognitively grounded bilingual multimodal benchmark for Vision-Language Models, revealing significant cognitive asymmetries and cross-lingual performance gaps in current models.

  99. Hugging Face Daily Papers TIER_1 English(EN) ·

    AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

    AffordanceVLA introduces a unified framework that uses structured affordance forecasting as an intermediate representation to improve the precision of perception-action mapping in robotic manipulation by leveraging vision-language models.

  100. Hugging Face Daily Papers TIER_1 English(EN) ·

    DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

    DRIFT is a framework that adapts pretrained vision-language models for continuous decoding tasks by combining coarse prediction with iterative refinement through flow matching, improving performance across perception and planning tasks.

  101. arXiv cs.AI TIER_1 English(EN) · Camille Couprie ·

    Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have

    We propose a label-free approach to adapt powerful but generic vision foundation models to specialized scientific domains. Standard supervised fine-tuning is often ill-suited to these settings: labels are scarce, and task-specific training can collapse the model's generality and …

  102. arXiv cs.LG TIER_1 English(EN) · Zhiqiang Shen ·

    Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models

    Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating a…

  103. arXiv cs.CL TIER_1 English(EN) · Andreas Geiger ·

    NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

    Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leavi…

  104. Hugging Face Daily Papers TIER_1 English(EN) ·

    NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

    Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leavi…

  105. arXiv cs.AI TIER_1 English(EN) · Anjie Liu, Yan Song, Zhixun Chen, Ziqin Gong, Zhongwei Yu, Jun Wang ·

    ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents

    arXiv:2606.03054v1 Announce Type: new Abstract: Tool-augmented vision-language agents can acquire external perceptual evidence through OCR, detection, segmentation, and other tools, but executing every proposed tool call is costly and sometimes unnecessary. We study the pre-call …

  106. arXiv cs.AI TIER_1 English(EN) · Nikolaj Hindsbo, Sina Ehsani, Pragyana Mishra ·

    SCOPE: Real-Time Natural Language Camera Agent at the Edge

    arXiv:2606.02951v1 Announce Type: cross Abstract: Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with reproducible outcomes. Such agents must connect language models to callable perception and …

  107. arXiv cs.AI TIER_1 English(EN) · Ying Tang, Dong Li, Youjia Zhang, Zikai Song, Junqing Yu, Wei Yang ·

    PRISM: Synergizing Vision Foundation Models via Self-organized Expert Specialization

    arXiv:2606.03444v1 Announce Type: cross Abstract: Unifying the complementary strengths of diverse Vision Foundation Models (VFMs) into a single efficient model is highly desirable but challenged by the negative transfer inherent in monolithic distillation. To address these featur…

  108. arXiv cs.AI TIER_1 English(EN) · Ziyang Chen, Shaoguang Wang, Weiyu Guo, Qianyi Cai, He Zhang, Pengteng Li, Yiren Zhao, Yandong Guo ·

    PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models

    arXiv:2606.03598v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these models in open-ended environments requires continuously acquiring novel skills, a process …

  109. arXiv cs.AI TIER_1 English(EN) · Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen ·

    Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model Enhancement

    arXiv:2412.01282v2 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) bring powerful understanding and reasoning capabilities to multimodal tasks. Meanwhile, the great need for capable aritificial intelligence on mobile devices also arises, such as the AI assist…

  110. arXiv cs.AI TIER_1 English(EN) · Xinpeng Dong, Min Zhang, Kairong Han, Xu Tan, Fei Wu, Kun Kuang ·

    Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

    arXiv:2605.18160v2 Announce Type: replace-cross Abstract: In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily attributed to effective paradigms for integrating visual and textual information. The dominant connector-based paradigm…

  111. arXiv cs.CL TIER_1 English(EN) · Youssef Mohamed, Kenneth Ward Church, Mohamed Elhoseiny ·

    Beyond Semantics: Modeling Factual and Affective Perceptual Experiences from Vision-Language Data

    arXiv:2606.03345v1 Announce Type: cross Abstract: We present P-Topics (Perception Topics) modeling, a novel problem for understanding how images are perceived affectively and across cultures. The goal is to (1) discover and model the different perception experiences in a dataset …

  112. Hugging Face Daily Papers TIER_1 English(EN) ·

    Stateful Visual Encoders for Vision-Language Models

    Stateful visual encoders condition visual representations on prior features, improving visual comparison tasks in vision-language models.

  113. Hugging Face Daily Papers TIER_1 Italiano(IT) ·

    Video2LoRA: Parametric Video Internalization for Vision-Language Models

    Video2LoRA enables efficient video processing in vision-language models by predicting Low-Rank Adaptation weights from video representations, reducing computational costs while maintaining video-faithful outputs.

  114. arXiv cs.AI TIER_1 English(EN) · Yandong Guo ·

    PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models

    Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these models in open-ended environments requires continuously acquiring novel skills, a process that inevitably triggers severe catastrophic forge…

  115. arXiv cs.CL TIER_1 English(EN) · Mohamed Elhoseiny ·

    Beyond Semantics: Modeling Factual and Affective Perceptual Experiences from Vision-Language Data

    We present P-Topics (Perception Topics) modeling, a novel problem for understanding how images are perceived affectively and across cultures. The goal is to (1) discover and model the different perception experiences in a dataset of images and captions, where each experience is d…

  116. Hugging Face Daily Papers TIER_1 English(EN) ·

    Beyond Semantics: Modeling Factual and Affective Perceptual Experiences from Vision-Language Data

    We present P-Topics (Perception Topics) modeling, a novel problem for understanding how images are perceived affectively and across cultures. The goal is to (1) discover and model the different perception experiences in a dataset of images and captions, where each experience is d…

  117. arXiv cs.LG TIER_1 English(EN) · Jiaheng Hu, Jay Shim, Chen Tang, Yoonchang Sung, Bo Liu, Peter Stone, Roberto Martin-Martin ·

    Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning

    arXiv:2603.11653v2 Announce Type: replace Abstract: Continual Reinforcement Learning (CRL) for Vision-Language-Action (VLA) models is a promising direction toward self-improving embodied agents that can adapt in openended, evolving environments. However, conventional wisdom from …

  118. arXiv cs.LG TIER_1 English(EN) · Luca M. Schulze Buschoff, Konstantinos Voudouris, Can Demircan, Eric Schulz ·

    Can Vision Language Models Learn Intuitive Physics from Interaction?

    arXiv:2602.06033v2 Announce Type: replace Abstract: Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not…

  119. arXiv cs.AI TIER_1 English(EN) · Abhijith Babu, Ramneet Kaur, Nathaniel D. Bastian, Olivera Kotevska, Susmit Jha, Yanzhao Wu, Sumit Kumar Jha, Anirban Roy ·

    Closed-Loop Neural Activation Control in Vision-Language-Action Models

    arXiv:2606.00269v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models can be steered at test time by intervening on semantically meaningful internal directions, but existing methods use a fixed steering coefficient, effectively operating in open loop. This is poorly…

  120. arXiv cs.LG TIER_1 English(EN) · Bing-Cheng Chuang, I-Hsuan Chu, Bor-Jiun Lin, YuanFu Yang, Min Sun, Chun-Yi Lee ·

    The Lie We Tell: Correcting the Euclidean Fallacy in Vision Language Action Policies via Score Matching on Tangent Space

    arXiv:2606.01847v1 Announce Type: cross Abstract: Diffusion-based Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the $\textbf{Euclidean Fallacy}$: representing SE(3) poses as flat $\mathbb{R}^{1…

  121. arXiv cs.AI TIER_1 English(EN) · Zhiyuan Feng, Qixiu Li, Huizhi Liang, Rushuai Yang, Yichao Shen, Zhiying Du, Zhaowei Zhang, Yu Deng, Li Zhao, Hao Zhao, Zongqing Lu, Oier Mees, Marc Pollefeys, Jiaolong Yang, Baining Guo ·

    From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data

    arXiv:2606.00054v1 Announce Type: cross Abstract: Recent progress in generalizable embodied control has been driven by large-scale pretraining of Vision-Language-Action (VLA) models. However, most existing approaches rely on large collections of robot demonstrations, which are co…

  122. arXiv cs.AI TIER_1 English(EN) · Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, Jianyu Chen ·

    VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

    arXiv:2601.03309v2 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper r…

  123. arXiv cs.AI TIER_1 English(EN) · Sangin Lee, Yukyung Choi ·

    CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

    arXiv:2605.13178v2 Announce Type: replace-cross Abstract: In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less infor…

  124. arXiv cs.AI TIER_1 English(EN) · Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami, Alireza Heidarikhazaei, Zhou Weimin, Yong Zhang, Mohammad Akbari ·

    From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

    arXiv:2512.05277v3 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, with autonomous driving (AD) being one of the most safety-critical instances. Reliab…

  125. arXiv cs.AI TIER_1 English(EN) · Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee ·

    Understanding the Effects of Distractors on Reasoning Vision-Language Models

    arXiv:2511.21397v2 Announce Type: replace-cross Abstract: How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior work on text-only language models has shown that textual distractors can intensify inverse scaling, causi…

  126. arXiv cs.AI TIER_1 English(EN) · Zhisheng Chen, Tingyu Wu, Zijie Zhou, Zhengwei Xie, Jinhan Li, Ziyan Weng, Liang Lin, Jingwei Song, Zikai Xiao, Yingwei Zhang ·

    PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Vision-Language Models

    arXiv:2602.00415v2 Announce Type: replace Abstract: Memory is not merely a storage mechanism for intelligent systems, but a structure for organizing evidence and constraining belief. This is especially important for multimodal reasoning, where retrieved evidence must be both quer…

  127. arXiv cs.CL TIER_1 English(EN) · Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom, Ji Woo Hong, Mark Hasegawa-Johnson, Qi Dai, Chong Luo, Chang D. Yoo ·

    Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding

    arXiv:2606.00564v1 Announce Type: cross Abstract: While on-policy distillation offers dense supervision for training small reasoning models, its optimization dynamics in the multimodal domain remain under-explored. In this work, we challenge the standard monolithic view of Vision…

  128. arXiv cs.LG TIER_1 English(EN) · Haiyu Wang, Yutong Wang, Leshu Li, Yihui Ren, Sai Qian Zhang ·

    LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models

    arXiv:2606.00573v1 Announce Type: new Abstract: Vision-language models (VLMs) deliver strong multimodal reasoning capabilities, but their large computational cost and high parameter counts make deployment challenging on resource-constrained devices. Low-rank decomposition has eme…

  129. arXiv cs.AI TIER_1 Italiano(IT) · Soroosh Tayebi Arasteh, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn ·

    Cross-modal linkage risk in clinical vision-language models

    arXiv:2606.02276v1 Announce Type: cross Abstract: Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve instance-level image-report correspondence. This poses a privacy risk in settings where radio…

  130. arXiv cs.AI TIER_1 English(EN) · Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv ·

    On the Limits of Token Reduction for Efficient Unified Vision Language Training

    arXiv:2606.01503v1 Announce Type: cross Abstract: Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency pe…

  131. arXiv cs.AI TIER_1 English(EN) · Sayeed Shafayet Chowdhury, Md. Shaown Miah ·

    Detect Before You Leap: Mirage Detection in Vision-Language Models

    arXiv:2606.00435v1 Announce Type: cross Abstract: Vision-language models (VLMs) can produce confident visual answers even when the required visual evidence is missing, blank, or unrelated to the question. This failure mode, known as mirage (Asadi et al. 2026), is especially conce…

  132. arXiv cs.AI TIER_1 English(EN) · Haofan Cao, Zhaoyang Li, Zhichao You, Liang Guo, Tianrui Li ·

    PaCo-VLA: Passivity-Shielded Compliance Prior for Contact-Rich Vision-Language-Action Manipulation

    arXiv:2606.00515v1 Announce Type: cross Abstract: Contact-rich manipulation demands both high-level semantic reasoning and the safe regulation of high-frequency contact dynamics. While Vision-Language-Action (VLA) models provide unprecedented semantic generalization, their low-ra…

  133. arXiv cs.AI TIER_1 English(EN) · Rashid Mushkani ·

    Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated

    arXiv:2606.00871v1 Announce Type: cross Abstract: Vision-language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetscape auditing, mapping, and public consultation. These uses combine observable attributes wit…

  134. arXiv cs.LG TIER_1 English(EN) · Pau Montagut Bofi, Mario Garc\'ia Blasco, Tessa Pulli, Markus Vincze ·

    Per-Group Error, Not Total MSE: Fine-Tuning Vision-Language-Action Models for 11-DoF Mobile Manipulation

    arXiv:2606.00253v1 Announce Type: cross Abstract: Fine-tuning Vision-Language-Action (VLA) models for mobile manipulators with heterogeneous joint spaces can produce a counterintuitive result: the checkpoint with the lowest aggregate MSE is not the one that performs best on the r…

  135. arXiv cs.AI TIER_1 English(EN) · Zijie Zhou, Dandan Zhu, Hangxiangpan Wang, Heng Zhang, Huishen Jiao, Yi Zhao ·

    Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models

    arXiv:2606.00275v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. Recent studies introduce Mixture of Experts (MoE) into LVLMs for improved compu…

  136. arXiv cs.AI TIER_1 English(EN) · Yueh-Hua Wu, Tatsuya Matsushima, Kei Ota ·

    Continuous Reasoning for Vision-Language-Action

    arXiv:2606.00229v1 Announce Type: cross Abstract: Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continuous control. Text and explicit subgoals operate at task-level granularity, whereas vision-l…

  137. arXiv cs.AI TIER_1 English(EN) · Kailing Li, Tianwen Qian, Lijin Yang, Yuqian Fu, Jingyu Gong, Xiaoling Wang, Liang He ·

    Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation

    arXiv:2606.00095v1 Announce Type: cross Abstract: Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision-language models (VLMs), a critical semantic-geometri…

  138. arXiv cs.LG TIER_1 English(EN) · Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Tian Nian, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo ·

    Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

    arXiv:2508.20072v4 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions autoregressively in a fixed left-to-right order …

  139. arXiv cs.LG TIER_1 English(EN) · Yitong Jiang, Hongjun Wang, Collin McCarthy, Hanrong Ye, David Wehr, Xinhao Li, Qi Dou, Tianfan Xue, Ka Chun Cheung, Simon See, Wonmin Byeon, Ke Chen, Kai Han, Jinwei Gu, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Sifei Liu ·

    Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders

    arXiv:2606.00746v1 Announce Type: cross Abstract: Vision foundation models are bottlenecked by the quadratic cost of self-attention, which limits usable resolution and increases the cost of large-scale pretraining. Subquadratic alternatives such as linear attention and state-spac…

  140. arXiv cs.LG TIER_1 English(EN) · Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick P\'erez, Raoul de Charette ·

    Domain Adaptation with a Single Vision-Language Embedding

    arXiv:2410.21361v2 Announce Type: replace-cross Abstract: Domain adaptation has been extensively investigated in computer vision but still requires access to target data at the training time, which might be difficult to obtain in real-world autonomous driving scenarios, especiall…

  141. Hugging Face Daily Papers TIER_1 English(EN) ·

    MAOAM: Unified Object and Material Selection with Vision-Language Models

    A unified vision-language model framework enables precise object and material selection through text or click interactions, supporting diverse editing workflows with improved robustness.

  142. arXiv cs.AI TIER_1 Italiano(IT) · Daniel Truhn ·

    Cross-modal linkage risk in clinical vision-language models

    Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve instance-level image-report correspondence. This poses a privacy risk in settings where radiographs and reports are deliberately kept separate …

  143. Hugging Face Daily Papers TIER_1 English(EN) ·

    The Lie We Tell: Correcting the Euclidean Fallacy in Vision Language Action Policies via Score Matching on Tangent Space

    Diffusion-based Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the $\textbf{Euclidean Fallacy}$: representing SE(3) poses as flat $\mathbb{R}^{12}$ vectors. This approximation induces (1) manifo…

  144. arXiv cs.CL TIER_1 English(EN) · Mihai Masala, Marius Leordeanu, Mihai Dascalu, Traian Rebedea ·

    "In\^{t}elegi Rom\^ane\c{s}te?'' A Recipe for Romanian Vision-Language Models

    arXiv:2605.31401v1 Announce Type: new Abstract: Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-resource languages, where neither large-scale image-text corpora nor culturally grounded evaluat…

  145. arXiv cs.LG TIER_1 English(EN) · Yijie Tong, Yifan Hou, Shaobo Cui, Antoine Bosselut, Mrinmaya Sachan ·

    Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models

    arXiv:2605.30713v1 Announce Type: new Abstract: Test-time compute (TTC) strategies have emerged as a lightweight approach to boost reasoning in large language models (LLMs). However, their application and benefits for vision-language models (VLMs) remain underexplored. We present…

  146. arXiv cs.AI TIER_1 English(EN) · Yuriel Ryan, Hei Man Ip, Adriel Kuek, Paul Pu Liang, Roy Ka-Wei Lee ·

    Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models

    arXiv:2605.08145v2 Announce Type: replace-cross Abstract: Current vision language models face hallucination and robustness issues against ambiguous or corrupted modalities. We hypothesize that these issues can be addressed by exploiting the shared information between modalities t…

  147. arXiv cs.AI TIER_1 English(EN) · Taiyi Su, Jian Zhu, Tianjian Wang, Youzhang He, Zitai Huang, Jianjun Zhang, Chong Ma, Hanyang Wang, Tianjiao Zhang, Munan Yin, Weihao Ding, Yi Xu ·

    DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

    arXiv:2605.31286v1 Announce Type: cross Abstract: Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a r…

  148. arXiv cs.AI TIER_1 English(EN) · Jun Wang, Xiaohao Xu, Xiaonan Huang ·

    Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration

    arXiv:2605.31196v1 Announce Type: cross Abstract: Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability…

  149. arXiv cs.AI TIER_1 English(EN) · Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu, Vikas Chandra, Yangyang Shi ·

    VLM3: Vision Language Models Are Native 3D Learners

    arXiv:2605.30561v1 Announce Type: cross Abstract: Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision…

  150. arXiv cs.CL TIER_1 Română(RO) · Traian Rebedea ·

    Do you understand Romanian? A Recipe for Romanian Vision-Language Models

    Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-resource languages, where neither large-scale image-text corpora nor culturally grounded evaluations exist. We present a systematic study of bui…

  151. arXiv cs.AI TIER_1 English(EN) · Yi Xu ·

    DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

    Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handl…

  152. arXiv cs.AI TIER_1 English(EN) · Xiaonan Huang ·

    Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration

    Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability collision grounding: binding visual observations …

  153. arXiv cs.AI TIER_1 English(EN) · Chung-En Johnny Yu, Brian Jalaian, Nathaniel D. Bastian ·

    SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems

    arXiv:2603.23853v3 Announce Type: replace Abstract: Combining multiple Vision-Language Models (VLMs) can enhance multimodal reasoning and robustness, but aggregating heterogeneous models' outputs amplifies uncertainty and increases the risk of hallucinations. We propose SCoOP (Se…

  154. arXiv cs.AI TIER_1 English(EN) · Jianxin Lin, Chunzheng Zhu, Peter J. Kneuertz, Yunfei Bai, Yuan Xue ·

    When Models Learn to Ask Why: Adaptive Causal Reasoning for Trustworthy Medical Vision-Language Models

    arXiv:2603.23085v2 Announce Type: replace Abstract: Vision-Language Models (VLMs) have enabled interpretable medical diagnosis by integrating visual perception with linguistic reasoning. Yet, existing medical chain-of-thought (CoT) models lack explicit mechanisms to represent and…

  155. arXiv cs.LG TIER_1 English(EN) · Taeyoung Kim, Jimin Lee, Myungkyu Koo, Dongyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, Jinwoo Shin ·

    Contrastive Representation Regularization for Vision-Language-Action Models

    arXiv:2510.01711v3 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) models have shown strong capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain sub…

  156. arXiv cs.AI TIER_1 English(EN) · Shengyu Si, Yuanzhuo Lu, Ruimeng Yang, Ziyi Ye, Zuxuan Wu, Yu-Gang Jiang ·

    VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models

    arXiv:2605.29562v1 Announce Type: cross Abstract: Vision-Language-Action~(VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to generalize to unseen tasks that necessitate transferring relevant experience across objects, scen…

  157. arXiv cs.LG TIER_1 English(EN) · Mohammadreza Teymoorianfard, Jean-Philippe Monteuuis, Jonathan Petit, Amir Houmansadr ·

    ReasonBreak: Probing Vulnerabilities in Reasoning-Enabled Vision-Language-Action Models for Autonomous Driving

    arXiv:2605.29114v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models with integrated reasoning have been proposed for end-to-end autonomous driving, assuming a tight coupling between reasoning and trajectory generation. However, the robustness of such systems und…

  158. arXiv cs.LG TIER_1 English(EN) · Yilin Feng, Ahmed Burak Gulhan, Mahmut Taylan Kandemir ·

    AsymVLM: Asymmetric Token Pruning for Efficient Vision-Language Model Inference

    arXiv:2605.29535v1 Announce Type: new Abstract: Vision-Language Models (VLMs) process thousands of visual tokens per image alongside comparatively few text tokens, yet existing compression methods treat both modalities uniformly. We observe that the two modalities have fundamenta…

  159. arXiv cs.AI TIER_1 English(EN) · Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai, Binhe Yu, Zheqi Lv, Haoyu Zheng, Jiaqi Zhu, Zhiqi Ge, Zixuan Wan, Siliang Tang, Yueting Zhuang ·

    VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

    arXiv:2605.30011v1 Announce Type: cross Abstract: Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfer…

  160. arXiv cs.CL TIER_1 English(EN) · Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun, Mingjia Zhang, Baobao Chang ·

    Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

    arXiv:2411.14279v2 Announce Type: replace-cross Abstract: Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to d…

  161. arXiv cs.CL TIER_1 English(EN) · Feng Han, Zhixiong Zhang, Zheming Liang, Yibin Wang, Jiaqi Wang ·

    LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

    arXiv:2605.30265v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question…

  162. arXiv cs.CL TIER_1 English(EN) · Amrita Mazumdar, Seonwook Park, Rajarshi Roy, Nikhil Srihari, Shengze Wang, Yuhao Zhou, Julia Wang, Koki Nagano, Shalini De Mello ·

    VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

    arXiv:2605.30256v1 Announce Type: cross Abstract: Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent…

  163. arXiv cs.AI TIER_1 English(EN) · Qian Chen, Xianyin Zhang, Yanzhi Liu, Lifan Guo, Feng Chen, Chi Zhang ·

    Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset

    arXiv:2605.29462v1 Announce Type: cross Abstract: The emergence of Large Vision-Language Models (LVLMs) has substantially expanded model capabilities beyond text-only understanding, enabling unified inference across both visual and textual modalities and supporting a broader rang…

  164. arXiv cs.AI TIER_1 English(EN) · Yizhuo Lu, Changde Du, Qingyu Shi, Hang Chen, Jie Peng, Liuyun Jiang, Shuangchen Zhao, Huiguang He ·

    Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion

    arXiv:2605.29591v1 Announce Type: new Abstract: Modeling the interplay between external stimuli and internal neural representations is a pivotal research area for Brain-Computer Interfaces (BCIs). A major limitation of prior work is the prevailing paradigm of specialized, single-…

  165. arXiv cs.CL TIER_1 English(EN) · Emmanuelle Bourigault ·

    World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models

    arXiv:2605.29585v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly used to answer questions about physical scenes, yet most evaluations reduce performance to a final answer. This hides whether the model perceived the right objects, represented the righ…

  166. arXiv cs.CL TIER_1 English(EN) · Xueqing Wu, Yu-Chi Lin, Kai-Wei Chang, Nanyun Peng ·

    On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training

    arXiv:2605.29496v1 Announce Type: new Abstract: Post-training has greatly improved reasoning in frontier vision-language models, yet its gains for perception remain comparatively limited, creating a bottleneck for end-to-end visual reasoning. To investigate this gap, we introduce…

  167. arXiv cs.AI TIER_1 English(EN) · Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele, Federico Tombari, Muhammad Ferjad Naeem ·

    PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

    arXiv:2605.30126v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can ru…

  168. arXiv cs.AI TIER_1 English(EN) · Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, Xuhong Huang, Pei Lin, Junyang Lin, Dayiheng Liu, Shuai Bai, Jingren Zhou, Jiazhao Zhang, Haoqi Yuan, Gengze Zhou, Hang Yin, Ye Wa… ·

    Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

    arXiv:2605.30280v1 Announce Type: cross Abstract: Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embod…

  169. arXiv cs.AI TIER_1 English(EN) · Haoyuan Shi, Xiancong Ren, Yingji Zhang, Qinfan Zhang, Jiayu Hu, Haozhe Shan, Han Dong, Jinpeng Lu, Yinda Chen, Yi Zhang, Yong Dai, Xiaozhu Ju ·

    VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

    arXiv:2605.30117v1 Announce Type: new Abstract: Understanding how Vision-Language-Action (VLA) models transform multimodal knowledge into embodied control remains an open challenge. We present VLA-Trace, a progressive diagnostic framework that analyzes VLA models through a unifie…

  170. arXiv cs.AI TIER_1 English(EN) · Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Minjia Zhang, Junjie Hu ·

    MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

    arXiv:2507.09574v3 Announce Type: replace-cross Abstract: Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address the…

  171. arXiv cs.LG TIER_1 English(EN) · Xingzhou Pang, Yifan Hou, Junling Wang, Mrinmaya Sachan ·

    Unveiling the Visual Counting Bottleneck in Vision-Language Models

    arXiv:2605.30170v1 Announce Type: cross Abstract: While Large Vision-Language Models (VLMs) excel at interpolation, they suffer catastrophic failures in systematic generalization, most notably in visual counting. In this work, we investigate this extrapolation bottleneck by decon…

  172. Hugging Face Daily Papers TIER_1 English(EN) ·

    SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

    Semantic Object Correspondence (SOCO) benchmark evaluates structured object understanding in vision models through consistent part-level annotations and keypoint descriptions, revealing gaps between language-grounded localization and visual correspondence while demonstrating stro…

  173. arXiv cs.AI TIER_1 English(EN) · Xionghui Chen ·

    Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

    Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneo…

  174. arXiv cs.CL TIER_1 English(EN) · Jiaqi Wang ·

    LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

    Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave …

  175. arXiv cs.CL TIER_1 English(EN) · Shalini De Mello ·

    VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

    Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent interaction, agents must model full-duplex audiov…

  176. arXiv cs.LG TIER_1 English(EN) · Mrinmaya Sachan ·

    Unveiling the Visual Counting Bottleneck in Vision-Language Models

    While Large Vision-Language Models (VLMs) excel at interpolation, they suffer catastrophic failures in systematic generalization, most notably in visual counting. In this work, we investigate this extrapolation bottleneck by deconstructing visual counting into three cognitive sta…

  177. arXiv cs.AI TIER_1 English(EN) · Muhammad Ferjad Naeem ·

    PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

    Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, exist…

  178. arXiv cs.AI TIER_1 English(EN) · Xiaozhu Ju ·

    VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

    Understanding how Vision-Language-Action (VLA) models transform multimodal knowledge into embodied control remains an open challenge. We present VLA-Trace, a progressive diagnostic framework that analyzes VLA models through a unified evidence chain from representation dynamics to…

  179. arXiv cs.CL TIER_1 English(EN) · Marcell Fekete, Johannes Bjerva, Tam\'as K\'aldi ·

    When Discourse Pressures Conflict: Information Structure in Vision-Language Model Outputs

    arXiv:2605.28346v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly evaluated for whether they identify the right visual content, but little is known about whether they express such content in a discourse-appropriate form. We address this research gap u…

  180. arXiv cs.AI TIER_1 English(EN) · Xiaomin Yu, Wenjie Zhang, Ziyue Qiao, Chengwei Qin, Hui Xiong ·

    Text-Only Data Synthesis for Vision Language Model Training

    arXiv:2503.22655v2 Announce Type: replace Abstract: Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question…

  181. arXiv cs.AI TIER_1 English(EN) · Antonia Karamolegkou, Nicolas Angleraud, Beno\^it Sagot, Thibault Cl\'erice ·

    Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

    arXiv:2605.27750v1 Announce Type: cross Abstract: Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with tr…

  182. arXiv cs.AI TIER_1 English(EN) · Semi Lee, Hyejin Go, Hyesong Choi ·

    AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers

    arXiv:2605.27465v1 Announce Type: cross Abstract: The quadratic cost of self-attention in Vision Transformers (ViTs) constitutes a fundamental bottleneck for practical deployment, motivating a vibrant line of research on token reduction. Among existing approaches, token merging (…

  183. arXiv cs.AI TIER_1 English(EN) · Xucong Wang, Pengkun Wang, Zhe Zhao, Liheng Yu, Shuang Wang, Yang Wang ·

    FedMPT: Federated Multi-label Prompt Tuning of Vision-Language Models

    arXiv:2605.28347v1 Announce Type: new Abstract: Multi-Label Recognition (MLR) based on Vision-Language Models (VLMs) aims to leverage their pre-trained knowledge to better adapt complex recognition scenarios, thereby enhancing model robustness. However, for realistic decentralize…

  184. arXiv cs.AI TIER_1 English(EN) · Fengze Yang, Bo Yu, Xuewen Luo, Cathy Liu, Chenxi Liu ·

    CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

    arXiv:2605.28115v1 Announce Type: new Abstract: Vision-Language Models (VLMs) face severe memory and latency bottlenecks due to high-resolution visual tokens. While current token reduction methods theoretically save FLOPs, post-hoc pruning introduces structural overhead, failing …

  185. arXiv cs.CL TIER_1 English(EN) · Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Hao Wang, Xin Li, Yujian Xiong, Jiajun Cheng, Jingjing Wang, Xiaobing Yu, Haiyu Wu, Shao Tang, Zhipeng Wang, Langechuan Liu, Shan Lin, Oana Dumitrascu, Yalin Wang ·

    OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models

    arXiv:2605.27916v1 Announce Type: cross Abstract: The advancement of general medical Multimodal Large Language Models (MLLMs) has shown great potential for building conversational assistants to support clinical diagnosis. However, their adaptation to highly specialized domains su…

  186. arXiv cs.CL TIER_1 English(EN) · Chinh Hoang, Mohammad Rashedul Hasan ·

    The Abstraction Gap in Vision-Language Causal Reasoning

    arXiv:2605.28779v1 Announce Type: new Abstract: Vision-language models (VLMs) generate fluent causal explanations, but current evaluations cannot distinguish linguistic plausibility from faithful causal reasoning. We introduce a dual-probe methodology that isolates these properti…

  187. arXiv cs.LG TIER_1 English(EN) · Xinyu Wang, Mingze Li, Sicheng Lyu, Dongxiu Liu, Kaicheng Yang, Ziyu Zhao, Yufei Cui, Xiao-Wen Chang, Peng Lu ·

    {\Omega}-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling

    arXiv:2605.28803v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models unify perception, reasoning, and control within a single policy, yet their multi-billion-parameter backbones and diffusion-based action heads make on-device deployment prohibitively expensive. P…

  188. arXiv cs.AI TIER_1 English(EN) · Guangyuan Li, Rongzhen Zhao, Jinhong Deng, Yanbo Wang, Joni Pajarinen ·

    Object-Centric Vision Token Pruning for Vision Language Models

    arXiv:2511.20439v2 Announce Type: replace-cross Abstract: In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM infere…

  189. Hugging Face Daily Papers TIER_1 English(EN) ·

    PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

    PARCEL is a vision-language model architecture that dynamically partitions feature extraction tasks to improve efficiency and performance across different visual-token budgets.

  190. Hugging Face Daily Papers TIER_1 English(EN) ·

    Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

    Vision-language models exhibit entangled spatial representations that correlate vertical image position with distance, impacting reasoning robustness and performance across benchmarks.

  191. Hugging Face Daily Papers TIER_1 English(EN) ·

    Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

    A unified vision-language-action model is presented that integrates diverse embodied decision-making tasks through a shared architecture and training approach, demonstrating strong performance across manipulation, navigation, and trajectory prediction with generalization across d…

  192. Hugging Face Daily Papers TIER_1 English(EN) ·

    LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

    Vision-language models suffer from modality sensitivity due to training data bias, but a new data curation approach called Local Modality Substitution improves cross-modal representation alignment and reasoning performance.

  193. Hugging Face Daily Papers TIER_1 English(EN) ·

    VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

    VisualThinking-VLA enables fast, accurate vision-language-action policies through visual reasoning that preserves spatial precision and reduces latency compared to text-based approaches.

  194. Hugging Face Daily Papers TIER_1 English(EN) ·

    VLM3: Vision Language Models Are Native 3D Learners

    Vision Language Models can be adapted for 3D understanding tasks through simple architectural modifications and text-based training, achieving performance comparable to specialized vision models without requiring complex designs or extensive data augmentation.

  195. Hugging Face Daily Papers TIER_1 English(EN) ·

    The Abstraction Gap in Vision-Language Causal Reasoning

    Vision-language models (VLMs) generate fluent causal explanations, but current evaluations cannot distinguish linguistic plausibility from faithful causal reasoning. We introduce a dual-probe methodology that isolates these properties. The Text-Only Probe measures linguistic qual…

  196. arXiv cs.CL TIER_1 English(EN) · Tamás Káldi ·

    When Discourse Pressures Conflict: Information Structure in Vision-Language Model Outputs

    Vision-language models (VLMs) are increasingly evaluated for whether they identify the right visual content, but little is known about whether they express such content in a discourse-appropriate form. We address this research gap using information structure (IS), testing whether…

  197. arXiv cs.NE (Neural & Evolutionary) TIER_1 English(EN) · Jon McCormack ·

    Evolving to the Aesthetics of a Vision-Language Model

    Evolutionary systems have demonstrated remarkable results in creative domains, with recent applications in generative typography, design, and music. However, an open problem remains in designing fitness functions that effectively capture the desired aesthetics of abstract outputs…

  198. arXiv cs.AI TIER_1 English(EN) · Takaki Yamamoto, Chihiro Noguchi, Toshihiro Tanizawa ·

    Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data

    arXiv:2601.12809v2 Announce Type: replace-cross Abstract: Spatial understanding remains a key challenge in vision-language models. Yet it is still unclear whether such understanding is truly acquired, and if so, through what mechanisms. We present a controllable 1D image-text tes…

  199. arXiv cs.CL TIER_1 English(EN) · Yifan Jiang, Dae Yon Hwang, Jesse C. Cresswell, Freda Shi ·

    Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

    arXiv:2605.27311v1 Announce Type: new Abstract: Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background…

  200. arXiv cs.AI TIER_1 English(EN) · Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei, Yangzhou Liu, Zhiqi Li, Yunze Man, Guo Chen, Andrew Tao, Guilin Liu, Jan Kautz, Lei Zhang, Zhiding Yu ·

    LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

    arXiv:2605.27365v1 Announce Type: cross Abstract: Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This tok…

  201. arXiv cs.AI TIER_1 English(EN) · Xintong Hu, Xuhong Huang, Jinyu Zhang, Yutong Yao, Yuchong Sun, Qiuyue Wang, Mingsheng Li, Sicheng Xie, Yitao Liu, Junhao Chen, Yixuan Chen, Yingming Zheng, Shuai Bai, Tao Yu ·

    FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

    arXiv:2605.27284v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models are increasingly expected to not only complete robot tasks, but also follow human instructions about how those tasks should be executed. However, existing robot datasets usually pair trajectorie…

  202. arXiv cs.CL TIER_1 English(EN) · Yifan Jiang, Ruoxi Ning, Sheng Yao, Freda Shi ·

    Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery

    arXiv:2605.27315v1 Announce Type: new Abstract: Visual inputs are often assumed to improve language understanding in multimodal models. We examine this assumption by asking whether vision-language models (VLMs) can distinguish useful visual evidence from incidental image context …

  203. arXiv cs.AI TIER_1 English(EN) · Xiang Fang, Wanlong Fang, Changshuo Wang ·

    Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization

    arXiv:2605.26501v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) have transformed multi-modal understanding, excelling in tasks like image captioning and visual question answering by integrating visual and textual inputs. However, their robustness against ad…

  204. arXiv cs.CL TIER_1 English(EN) · Taha Koleilat, Hassan Rivaz, Yiming Xiao ·

    Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

    arXiv:2605.26292v1 Announce Type: cross Abstract: Parameter-efficient adaptation of vision-language foundation models is crucial for precise multimodal understanding of biomedical images, yet existing methods remain deterministic and often struggle under domain shift or ambiguous…

  205. arXiv cs.AI TIER_1 English(EN) · Chen Ling, Tongwei Zhang, Hanqian Li, Nai Ding ·

    Seeing vs. Believing: Evaluating the Language Bias of Open-Source MLLMs in Counter-Intuitive Scenes

    arXiv:2601.07737v2 Announce Type: replace-cross Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in mainstream visual understanding tasks, but their ability to process action scenes that contradict everyday common sense remains undertest…

  206. Hugging Face Daily Papers TIER_1 English(EN) ·

    OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models

    The advancement of general medical Multimodal Large Language Models (MLLMs) has shown great potential for building conversational assistants to support clinical diagnosis. However, their adaptation to highly specialized domains such as ophthalmology remains underexplored, primari…

  207. Hugging Face Daily Papers TIER_1 English(EN) ·

    Towards Unified Vision-Language Models with Incomplete Multi-Modal Inputs

    Video-Language Models (VLMs) have demonstrated impressive multi-modal reasoning capabilities across diverse computer vision applications. However, these VLMs are task-specific and assume that both video and language inputs are complete. However, real-world VLM applications might …

  208. Hugging Face Daily Papers TIER_1 English(EN) ·

    SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model Adaptation

    Vision Foundation Models (VFMs) have demonstrated impressive representational capabilities. However, adapting them to downstream tasks via full fine-tuning incurs prohibitive computational and storage overhead. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a compelling al…

  209. Hugging Face Daily Papers TIER_1 English(EN) ·

    From Pixels to Words -- Towards Native One-Vision Models at Scale

    NEO-ov is a native vision-language model that end-to-end learns cross-frame and pixel-word correspondences without modular components, enabling unified spatiotemporal modeling and competitive performance in visual perception tasks.

  210. arXiv cs.AI TIER_1 English(EN) · Zhiding Yu ·

    LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

    Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled struct…

  211. Hugging Face Daily Papers TIER_1 English(EN) ·

    LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

    Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled struct…

  212. arXiv cs.CL TIER_1 English(EN) · Freda Shi ·

    Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery

    Visual inputs are often assumed to improve language understanding in multimodal models. We examine this assumption by asking whether vision-language models (VLMs) can distinguish useful visual evidence from incidental image context in lexical judgments. We use human concreteness …

  213. arXiv cs.CL TIER_1 English(EN) · Freda Shi ·

    Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

    Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasonin…

  214. arXiv cs.AI TIER_1 English(EN) · Tao Yu ·

    FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

    Vision-Language-Action (VLA) models are increasingly expected to not only complete robot tasks, but also follow human instructions about how those tasks should be executed. However, existing robot datasets usually pair trajectories with coarse goal-level language, leaving executi…

  215. arXiv cs.AI TIER_1 English(EN) · Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng, Yuhui Yin ·

    FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

    arXiv:2510.10921v3 Announce Type: replace-cross Abstract: Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While mod…

  216. arXiv cs.AI TIER_1 English(EN) · Ulas Berk Karli, Ziyao Shangguan, Tesca FItzgerald ·

    INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models

    arXiv:2510.01389v2 Announce Type: replace-cross Abstract: Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor. We present \textbf{INSIGHT},…

  217. arXiv cs.AI TIER_1 English(EN) · Hanzhen Wang, Jiaming Xu, Yushun Xiang, Jiayi Pan, Yongkang Zhou, Yong-Lu Li, Guohao Dai ·

    SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

    arXiv:2509.05614v3 Announce Type: replace-cross Abstract: Pruning is a typical acceleration technique for compute-bound models by removing computation on unimportant values. Recently, it has been applied to accelerate Vision-Language-Action (VLA) model inference. However, existin…

  218. arXiv cs.AI TIER_1 English(EN) · Kailing Li, Qi'ao Xu, Tianwen Qian, Yuqian Fu, Yang Jiao, Xiaoling Wang ·

    CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning

    arXiv:2506.17629v2 Announce Type: replace-cross Abstract: Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potenti…

  219. arXiv cs.AI TIER_1 English(EN) · Sriram Mandalika ·

    Generalizable Vision-Language Few-Shot Adaptation with Predictive Prompts and Negative Learning

    arXiv:2505.11758v2 Announce Type: replace-cross Abstract: Few-shot adaptation of vision-language models remains fundamentally limited by how negative class signals are handled at inference. Existing methods apply uniform negative suppression across all queries, ignoring that the …

  220. arXiv cs.AI TIER_1 English(EN) · Xinrui Shi, Kai Liu, Ziqing Zhang, Jianze Li, Anqi Li, Yulun Zhang ·

    DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models

    arXiv:2605.26038v1 Announce Type: cross Abstract: Lightweight vision-language models perform competitively on standard benchmarks yet fail systematically in dense-scene reasoning, where multiple objects, attributes, and relations must be jointly grounded and resolved through mult…

  221. arXiv cs.AI TIER_1 English(EN) · Perry Dong, Kuo-Han Hung, Tian Gao, Dorsa Sadigh, Chelsea Finn ·

    EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models

    arXiv:2605.25477v1 Announce Type: cross Abstract: The ability to efficiently and reliably learn new tasks has been a foundational challenge in robotics. Vision-Language-Action (VLA) models have demonstrated strong generalization across diverse manipulation tasks, yet pretrained p…

  222. arXiv cs.AI TIER_1 English(EN) · Bruce Changlong Xu, Jose James, Alexander Ryu ·

    From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for Vision-Language Model Weak Supervision Across Three Medical-Imaging Benchmarks

    arXiv:2605.24771v1 Announce Type: cross Abstract: Classical noisy-label theory predicts that downstream performance under weak supervision is bounded above by the labeler's accuracy, implying a sharp crossover: once a gold-trained classifier matches the labeler, weak labels stop …

  223. arXiv cs.AI TIER_1 English(EN) · Arash Akbari, Arman Akbari, Masih Eskandar, Qitao Tan, Yixiao Chen, Jingwu Luo, Bertha Pangaribuan, Liyun Zhang, Jennifer Dy, Geng Yuan, Xue Lin, Gaowen Liu, Stratis Ioannidis, Yanzhi Wang ·

    ActQuant: Sub-4-bit Action-Guided Quantization for Vision-Language-Action Models

    arXiv:2605.24011v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models exhibit remarkable action generation for embodied intelligence, but their heavy compute make deployment on edge platforms impractical. Aggressive, sub-4-bit weight quantization is the natural so…

  224. arXiv cs.AI TIER_1 English(EN) · Minghao Fu, Fan Feng, Nicklas Hansen, Biwei Huang ·

    Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations

    arXiv:2605.25620v1 Announce Type: new Abstract: World models enable agents to predict future dynamics conditioned on actions, making the choice of latent representation central to planning and control. Such representations are often either learned directly from pixels with limite…

  225. arXiv cs.AI TIER_1 English(EN) · Sam Earle, Kay Arulkumaran, Andrew Dai, Akarsh Kumar, Julian Togelius, Sebastian Risi ·

    In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models

    arXiv:2605.23908v1 Announce Type: new Abstract: We are in the midst of large-scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI-driven assistants. Historically, a fundamental property of these processes …

  226. arXiv cs.LG TIER_1 English(EN) · Mengqi He, Xinyu Tian, Xin Shen, Jinhong Ni, Shu Zou, Zhaoyuan Yang, Jing Zhang ·

    High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models

    arXiv:2512.21815v3 Announce Type: replace-cross Abstract: Vision-language models (VLMs) achieve remarkable performance but remain vulnerable to adversarial attacks. Entropy, as a measure of model uncertainty, is highly correlated with VLM reliability. While prior entropy-based at…

  227. arXiv cs.LG TIER_1 English(EN) · Lin Zhu, Yifeng Yang, Xinbing Wang, Qinying Gu, Nanyang Ye ·

    $\Delta \mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization

    arXiv:2510.11296v3 Announce Type: replace-cross Abstract: Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inevitably encounter both the in-distribution (ID…

  228. arXiv cs.LG TIER_1 English(EN) · Rui Zhu, Song-Lin Lv, Zi-Kang Wang, Lan-Zhe Guo ·

    Bi-CoG: Bi-Consistency-Guided Self-Training for Vision-Language Models

    arXiv:2510.20477v3 Announce Type: replace Abstract: Exploiting unlabeled data through semi-supervised learning (SSL) or leveraging pre-trained models via fine-tuning are two prevailing paradigms for addressing label-scarce scenarios. Recently, growing attention has been given to …

  229. arXiv cs.LG TIER_1 English(EN) · Jianwei Tai ·

    Capability and Robustness Cannot Both Be Free: An Information-Theoretic Bound for Vision-Language-Action Models

    arXiv:2605.25889v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models are increasingly deployed on real robots, where each predicted action is executed and each failure carries a safety cost. They reach high success rates on clean inputs but collapse under small a…

  230. arXiv cs.CL TIER_1 English(EN) · Azmine Toushik Wasi, Shahriyar Zaman Ridoy, Koushik Ahamed Tonmoy, Kinga Tshering, S. M. Muhtasimul Hasan, Wahid Faisal, Tasnim Mohiuddin, Md Rizwan Parvez ·

    TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings

    arXiv:2603.06687v2 Announce Type: replace-cross Abstract: Geo-temporal understanding, the ability to infer location, time, and contextual properties from visual input alone, underpins applications such as disaster management, traffic planning, embodied navigation, world modeling,…

  231. arXiv cs.CL TIER_1 English(EN) · Shristi Das Biswas, Kaushik Roy ·

    MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

    arXiv:2605.26004v1 Announce Type: cross Abstract: Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage o…

  232. arXiv cs.CL TIER_1 English(EN) · Farhad Nooralahzadeh, Benjamin Gundersen, Nicolas Deperrois, Hidetoshi Matsuom, Mizuho Nishio, Thomas Frauenfelder, Ahmed Allam, Christian Bl\"uthgen, Michael Moor, Michael Krauthammer ·

    Universal Boosts, Specific Suppressors: Sparse Autoencoder Steering of Medical Vision-Language Models

    arXiv:2605.24977v1 Announce Type: cross Abstract: Medical vision-language models (VLMs) often hallucinate findings when generating chest X-ray reports: they fabricate findings that are not present in the image, miss important ones, or locate them incorrectly. We mitigate this wit…

  233. Hugging Face Daily Papers TIER_1 English(EN) ·

    LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

    Parallel Box Decoding enables efficient and accurate unified visual grounding and detection by decoding geometric elements as atomic units, improving both throughput and localization quality.

  234. Hugging Face Daily Papers TIER_1 English(EN) ·

    Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

    Counterfactual charts are introduced to rigorously evaluate visual reasoning in chart question-answering by varying underlying data while keeping tasks fixed, revealing hidden model failures and generalization limitations.

  235. Hugging Face Daily Papers TIER_1 English(EN) ·

    Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

    Parameter-efficient adaptation of vision-language foundation models is crucial for precise multimodal understanding of biomedical images, yet existing methods remain deterministic and often struggle under domain shift or ambiguous image-text alignment. This limitation is particul…

  236. arXiv cs.AI TIER_1 English(EN) · Yulun Zhang ·

    DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models

    Lightweight vision-language models perform competitively on standard benchmarks yet fail systematically in dense-scene reasoning, where multiple objects, attributes, and relations must be jointly grounded and resolved through multi-step inference. Such capability is critical for …

  237. arXiv cs.LG TIER_1 English(EN) · Jianwei Tai ·

    Capability and Robustness Cannot Both Be Free: An Information-Theoretic Bound for Vision-Language-Action Models

    Vision-Language-Action (VLA) models are increasingly deployed on real robots, where each predicted action is executed and each failure carries a safety cost. They reach high success rates on clean inputs but collapse under small adversarial perturbations. A $16/255$ PGD attack on…

  238. Hugging Face Daily Papers TIER_1 English(EN) ·

    EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models

    The ability to efficiently and reliably learn new tasks has been a foundational challenge in robotics. Vision-Language-Action (VLA) models have demonstrated strong generalization across diverse manipulation tasks, yet pretrained policies consistently fall short of the reliability…

  239. arXiv cs.LG TIER_1 English(EN) · Jiapeng Zeng, Yogesh Prabhu, Zhanpeng Zeng, Michael A. Newton, Vikas Singh ·

    Empirical Bayes Conformal Prediction for Vision and Language Models

    arXiv:2605.23189v1 Announce Type: new Abstract: Conformal prediction (CP) gives distribution-free coverage for modern vision and language models, but it is often forced to make a ranking decision from a single unstable nonconformity score. Standard CP uses one realization, while …

  240. arXiv cs.CL TIER_1 English(EN) · Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, Han Liu ·

    PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

    arXiv:2601.15224v2 Announce Type: replace-cross Abstract: Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear wheth…

  241. arXiv cs.AI TIER_1 English(EN) · Ke Ren, Ali Salamatian, Kieran Pattison, Cyrus Neary ·

    V-VLAPS: Value-Guided Planning for Vision-Language-Action Models

    arXiv:2601.00969v2 Announce Type: replace-cross Abstract: Vision-language-action (VLA) models provide strong action priors for robotic manipulation, but their reactive behavior can fail under distribution shift and long-horizon task structure. Recent VLA-guided planning methods i…

  242. arXiv cs.AI TIER_1 English(EN) · Youngjin Hong, Houjian Yu, Mingen Li, Changhyun Choi ·

    LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation

    arXiv:2511.02239v2 Announce Type: replace-cross Abstract: Learning generalizable policies for robotic manipulation increasingly relies on large-scale models that map language instructions to actions (L2A). However, this one-way paradigm often produces policies that execute tasks …

  243. arXiv cs.AI TIER_1 English(EN) · Dimitrios Damianos, Leon Voukoutis, Georgios Skyrianos, Vassilis Katsouros, Georgios Paraskevopoulos ·

    Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models

    arXiv:2605.22902v1 Announce Type: cross Abstract: Generative Vision-Language Models (VLMs) perform well on multimodal reasoning, but how visual inputs are transformed to text remains poorly understood. Existing interpretability work on VLMs uses Sparse Autoencoders (SAEs), which …

  244. arXiv cs.AI TIER_1 English(EN) · Xiyang Wang, Xinlin Wang, Tingguang Zhou, Gong Chen, Xingtai Gui, Zhi Xu, Xiaolei Wu, Feiyang Tan, Hangning Zhou, Mu Yang ·

    ChainFlow-VLA: Causal Flow Planning with Vision-Language Models

    arXiv:2605.23270v1 Announce Type: cross Abstract: Current end-to-end autonomous driving systems are fundamentally limited by a mismatch between temporal causal reasoning and global trajectory consistency. Autoregressive (AR) models capture interaction-aware temporal dependencies …

  245. arXiv cs.AI TIER_1 English(EN) · Jongoh Jeong, Hoyong Kwon, Minseok Kim, Kuk-Jin Yoon ·

    Multimodal Distribution Matching for Vision-Language Dataset Distillation

    arXiv:2605.23482v1 Announce Type: cross Abstract: Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must pre…

  246. arXiv cs.AI TIER_1 English(EN) · Changhua Xu, En Yu, Junyu Xuan, Jie Lu ·

    VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

    arXiv:2602.07399v2 Announce Type: replace Abstract: Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plau…

  247. arXiv cs.AI TIER_1 English(EN) · Zixuan Lan, Luzhe Sun, Matthew R. Walter, Jiawei Zhou ·

    Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?

    arXiv:2605.22903v1 Announce Type: cross Abstract: Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly reflect reliance on visual evidence. Motivated by a …

  248. arXiv cs.AI TIER_1 English(EN) · Samar Fares, Klea Ziu, Toluwani Aremu, Nikita Durasov, Martin Tak\'a\v{c}, Pascal Fua, Karthik Nandakumar, Ivan Laptev ·

    MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

    arXiv:2406.09250v3 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) are increasingly susceptible to sophisticated adversarial attacks, including adaptive strategies specifically designed to bypass existing defenses. To address this vulnerability, we propose Mi…

  249. arXiv cs.AI TIER_1 English(EN) · Ruofan Jin, Zaixi Zhang ·

    Agentic-VLA: Efficient Online Adaptation for Vision-Language-Action Models

    arXiv:2605.22896v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models have emerged as a promising paradigm for robotic manipulation by leveraging pre-trained vision-language representations. However, current VLA training methods suffer from two critical limitation…

  250. Hugging Face Daily Papers TIER_1 English(EN) ·

    Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation

    Navigating under non-stationary environment shifts poses a critical challenge for a Vision-and-Language Navigation (VLN) agent deployed in the wild. Yet, existing Test-Time Adaptation (TTA) methods for VLN largely treat online adaptation as transient, isolated updates, leading to…

  251. arXiv cs.LG TIER_1 English(EN) · Piotr Kubaty, Patryk Marsza{\l}ek, {\L}ukasz Struski, Adam Wr\'obel, Jacek Tabor, Marek \'Smieja ·

    Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models

    arXiv:2605.22679v1 Announce Type: cross Abstract: Vision-language models learn powerful multimodal embeddings, yet their internal semantics remain opaque. While sparse autoencoders (SAEs) can extract interpretable features, they rely on expanding the representation dimension, whi…

  252. arXiv cs.AI TIER_1 English(EN) · Chen Li, Zhantao Yang, Fangyi Chen, Han Zhang, Anudeepsekhar Bolimera, Marios Savvides ·

    When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

    arXiv:2605.09860v3 Announce Type: replace Abstract: Long-horizon reasoning requires deciding not only what actions to take, but how deeply to commit before the next observation. We formalize this as \emph{commitment depth}: the number of primitive actions executed open-loop betwe…

  253. arXiv cs.AI TIER_1 English(EN) · Yvon Apedo, Martyna Poreba, Michal Szczepanski, Samia Bouchafa ·

    Beyond Attention Scores: SVD-Based Vision Token Pruning for Efficient Vision-Language Models

    arXiv:2604.11530v2 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) have revolutionized multi-modal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of process…

  254. arXiv cs.AI TIER_1 English(EN) · Yanting Miao, Yutao Sun, Dexin Wang, Mengyu Zhou, Pascal Poupart, Lei Lv, Qi Zhao, Li Wang, Hao Li, Xiaoxi Jiang, Guanjun Jiang ·

    Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

    arXiv:2605.12374v3 Announce Type: replace-cross Abstract: Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an outpu…

  255. arXiv cs.AI TIER_1 English(EN) · Zhenyu Yu, Yangchen Zeng, Chunlei Meng, Guangzhen Yao, Shuigeng Zhou ·

    Can Vision Models Truly Forget? Mirage: Representation-Level Certification of Visual Unlearning

    arXiv:2605.20282v1 Announce Type: cross Abstract: Machine unlearning in Vertical Federated Learning (VFL) has attracted growing interest, yet existing methods certify forgetting solely using output-level metrics. We challenge these claims by introducing Mirage, a representation-l…

  256. arXiv cs.AI TIER_1 English(EN) · Eric Tillmann Bill, Enis Simsar, Alessio Tonioni, Thomas Hofmann ·

    FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation

    arXiv:2605.20316v1 Announce Type: cross Abstract: Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision--language models derived from them recover bidirectional capability through…

  257. arXiv cs.AI TIER_1 English(EN) · Yulin Zhao, Yun Wang, Dehua Zheng, Borui jiang, Zheng Zhang ·

    Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

    arXiv:2605.20950v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) face a bottleneck of prohibitive computational costs arising from massive visual token sequences during inference. Existing vision token reduction methods alleviate this burden, but they unintentional…

  258. arXiv cs.LG TIER_1 English(EN) · Jiayun Wang, Yu Wang, Weijie Gan, Zhenting Wang, Wei Wei ·

    UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation

    arXiv:2605.21611v1 Announce Type: cross Abstract: We introduce spatially grounded contextual image generation, a controllable image generation task that reframes the conditioning paradigm. Instead of supplying a reference image and a global text prompt through two separate encode…

  259. arXiv cs.CL TIER_1 English(EN) · Jiawei Zhou ·

    Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?

    Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly reflect reliance on visual evidence. Motivated by a surprising observation that removing a substantial…

  260. arXiv cs.CL TIER_1 English(EN) · Georgios Paraskevopoulos ·

    Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models

    Generative Vision-Language Models (VLMs) perform well on multimodal reasoning, but how visual inputs are transformed to text remains poorly understood. Existing interpretability work on VLMs uses Sparse Autoencoders (SAEs), which decompose static residual representations and miss…

  261. arXiv cs.LG TIER_1 English(EN) · Marek Śmieja ·

    Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models

    Vision-language models learn powerful multimodal embeddings, yet their internal semantics remain opaque. While sparse autoencoders (SAEs) can extract interpretable features, they rely on expanding the representation dimension, which compromises the original geometry and introduce…

  262. arXiv cs.AI TIER_1 English(EN) · Zhijun Meng ·

    Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts

    While large vision-language-action (VLA) models and generative world models (WM) have advanced long-horizon embodied intelligence, their practical deployment remains challenged by uncertainty in learning-based action generation. Low-quality actions may cause physical failures dur…

  263. Hugging Face Daily Papers TIER_1 English(EN) ·

    UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation

    We introduce spatially grounded contextual image generation, a controllable image generation task that reframes the conditioning paradigm. Instead of supplying a reference image and a global text prompt through two separate encoders, one for vision and one for language, UniVL is …

  264. arXiv cs.AI TIER_1 English(EN) · Zheng Zhang ·

    Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

    Vision-Language Models (VLMs) face a bottleneck of prohibitive computational costs arising from massive visual token sequences during inference. Existing vision token reduction methods alleviate this burden, but they unintentionally preserve the isolated visual subject strictly a…

  265. Hugging Face Daily Papers TIER_1 English(EN) ·

    Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

    Large Vision-Language Models demonstrate significant limitations in fine-grained spatio-temporal reasoning and tracking abilities when evaluated on a new furniture assembly benchmark.

  266. Hugging Face Daily Papers TIER_1 English(EN) ·

    From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

    Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay …

  267. arXiv cs.CL TIER_1 English(EN) · Yuyin Zhou ·

    From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

    Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay …

  268. Hugging Face Daily Papers TIER_1 English(EN) ·

    From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

    Staged training approaches that separately optimize visual perception, visual reasoning, and textual reasoning in vision-language models outperform unified training methods, leading to improved performance on visual reasoning tasks.

  269. Hugging Face Daily Papers TIER_1 English(EN) ·

    See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

    SWIM is a training approach that aligns vision and language representations for fine-grained object understanding using only textual prompts by addressing cross-modal attention misalignment through mask supervision and a new dataset.

  270. Hugging Face Daily Papers TIER_1 English(EN) ·

    ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

    Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-tri…

  271. arXiv cs.CL TIER_1 English(EN) · Pipei Huang ·

    Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

    Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in both model size and inference cost. The…

  272. arXiv cs.CL TIER_1 English(EN) · Huan Liu ·

    To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model

    The rapid advancement of Large Vision-Language Models (LVLMs) is increasingly accompanied by unauthorized scraping and training on multimodal web data, posing severe copyright and privacy risks to data owners. Existing countermeasures, such as machine unlearning and watermarks, a…

  273. Hugging Face Daily Papers TIER_1 (CA) ·

    MMSkills: Towards Multimodal Skills for General Visual Agents

    Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: …

  274. arXiv cs.AI TIER_1 (CA) · Yong Yu ·

    MMSkills: Towards Multimodal Skills for General Visual Agents

    Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: …

  275. arXiv cs.CL TIER_1 English(EN) · Chang D. Yoo ·

    PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

    Reinforcement Learning with Verifiable Rewards (RLVR) traditionally relies on a sparse, outcome-based signal. Recent work shows that providing a fine-grained, model-intrinsic signal (rewarding the confidence growth in the ground-truth answer) effectively improves language reasoni…

  276. arXiv cs.CL TIER_1 English(EN) · Hinrich Schütze ·

    DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

    Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construct…

  277. arXiv cs.AI TIER_1 English(EN) · Taesik Gong ·

    Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete

    Vision-Language-Action (VLA) policies are typically evaluated as if the user had finished typing or speaking before the robot begins acting. In real deployment, however, users take several seconds to enter a request, leaving the policy idle for a substantial fraction of the inter…

  278. arXiv cs.CL TIER_1 English(EN) · Yong Li ·

    UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

    Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with visual latent tokens. This interleaved design limits efficiency and keeps reas…

  279. arXiv cs.CL TIER_1 English(EN) · Wenxin Yu ·

    Allegory of the Cave: Measurement-Grounded Vision-Language Learning

    Vision-language models typically reason over post-ISP RGB images, although RGB rendering can clip, suppress, or quantize sensor evidence before inference. We study whether grounding improves when the visual interface is moved closer to the underlying camera measurement. We formul…

  280. Hugging Face Daily Papers TIER_1 English(EN) ·

    CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

    This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can …

  281. arXiv cs.AI TIER_1 English(EN) · Marcin Chlebus ·

    Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality

    Artificial intelligence models are increasingly scaled to improve predictive accuracy, yet it remains unclear whether scale improves the quality of post-hoc explanations. We investigate this relationship by evaluating 11 computer vision models representing increasing levels of de…

  282. arXiv cs.AI TIER_1 English(EN) · Xingjun Ma ·

    ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

    Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In this paper, we take a training-free persp…

  283. arXiv cs.AI TIER_1 English(EN) · Mattia Rigotti ·

    GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

    Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision…

  284. arXiv cs.AI TIER_1 English(EN) · Zhenbo Xu ·

    RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

    Platform content moderation applies explicit policy rules and context-dependent conditions to decide whether user content is allowed, restricted, or removed. A correct moderation outcome must therefore depend on which rules a case activates, how those rules interact, and whether …

  285. arXiv cs.AI TIER_1 English(EN) · Zehao Deng, Tianjie Ju, Zheng Wu, Liangbo He, Jun Lan, Huijia Zhu, Weiqiang Wang, Zhuosheng Zhang ·

    Causal Probing for Internal Visual Representations in Multimodal Large Language Models

    arXiv:2605.05593v1 Announce Type: new Abstract: Despite the remarkable success of Multimodal Large Language Models (MLLMs) across diverse tasks, the internal mechanisms governing how they encode and ground distinct visual concepts remain poorly understood. To bridge this gap, we …

  286. arXiv cs.AI TIER_1 English(EN) · Yuxuan Wu, Guangming Wang, Zhiheng Yang, Maoqing Yao, Brian Sheil, Hesheng Wang ·

    Continually Evolving Skill Knowledge in Vision Language Action Model

    arXiv:2511.18085v3 Announce Type: replace-cross Abstract: Vision-language-action (VLA) models show promising knowledge accumulation ability from pretraining, yet continual learning in VLA remains challenging, especially for efficient adaptation. Existing continual imitation learn…

  287. arXiv cs.LG TIER_1 English(EN) · Cheng Xu, Xiaofeng Hou, Jiacheng Liu, Chao Li ·

    VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading

    arXiv:2605.05899v1 Announce Type: new Abstract: Large-scale vision-language mixture-of-experts (VL-MoE) models provide strong multimodal capability, but efficient deployment on memory-constrained platforms remains difficult. Existing MoE offloading systems are largely designed fo…

  288. arXiv cs.LG TIER_1 English(EN) · Chenyu Huang, Peng Ye, Xudong Tan, Jinhan Mu, Shenghe Zheng, Li Shen, Tao Chen ·

    FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models

    arXiv:2601.21187v2 Announce Type: replace-cross Abstract: Efficiently enhancing the reasoning capabilities of Vision-Language Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a co…

  289. arXiv cs.LG TIER_1 English(EN) · Shuyang Jiang, Nan Yu, Yiming Zhang, Zenghui Ding, Zhenyu Wu ·

    DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency

    arXiv:2605.06592v1 Announce Type: cross Abstract: Contrastive language-image pretraining (CLIP) suffers from two structural weaknesses: the symmetric InfoNCE loss discards the relative ordering among unmatched in-batch pairs, and global pooling collapses the visual representation…

  290. arXiv cs.LG TIER_1 English(EN) · Numan Saeed, Asif Hanif, Fadillah Adamsyah Maani, Hussain Alasmawi, Mohammad Yaqub ·

    DARK: Diagonal-Anchored Repulsive Knowledge Distillation for Vision-Language Models under Extreme Compression

    arXiv:2603.05421v3 Announce Type: replace-cross Abstract: Compressing vision-language models for on-device deployment is increasingly important in clinical settings, but knowledge distillation (KD) degrades sharply when the teacher-student capacity gap spans an order of magnitude…

  291. arXiv cs.LG TIER_1 English(EN) · Yuhua Jiang, Shuang Cheng, Yan Ding, Feifei Gao, Biqing Qi ·

    AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

    arXiv:2511.14148v2 Announce Type: replace-cross Abstract: Vision-language-action (VLA) models have recently emerged as a powerful paradigm for building generalist robots. However, traditional VLA models that generate actions through flow matching (FM) typically rely on rigid and …

  292. arXiv cs.LG TIER_1 English(EN) · Binyu Zhao, Wei Zhang, Xingrui Yu, Zhaonian Zou, Ivor Tsang ·

    Advancing Analytic Class-Incremental Learning through Vision-Language Calibration

    arXiv:2602.13670v2 Announce Type: replace Abstract: Class-incremental learning (CIL) with pre-trained models (PTMs) faces a critical trade-off between efficient adaptation and long-term stability. While analytic learning enables rapid, recursive closed-form updates, its efficacy …

  293. arXiv cs.LG TIER_1 English(EN) · St\'ephane d'Ascoli, J\'er\'emy Rapin, Yohann Benchetrit, Teon Brooks, Katelyn Begany, Jos\'ephine Raugel, Hubert Banville, Jean-R\'emi King ·

    A foundation model of vision, audition, and language for in-silico neuroscience

    arXiv:2605.04326v1 Announce Type: cross Abstract: Cognitive neuroscience is fragmented into specialized models, each tailored to specific experimental paradigms, hence preventing a unified model of cognition in the human brain. Here, we introduce TRIBE v2, a tri-modal (video, aud…

  294. arXiv cs.AI TIER_1 English(EN) · Shule Lu, Yujing Wang, Hainan Zhang, Xiaoshan Yang, Hongwei Zheng, Yongxin Tong, Changsheng Xu, Zhiming Zheng ·

    Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models

    arXiv:2605.03426v1 Announce Type: new Abstract: Vision-Language Models (VLMs) have broad potential in privacy-sensitive domains such as healthcare and finance, yet strict data-sharing constraints render centralized training infeasible. Federated Learning mitigates this issue by e…

  295. arXiv cs.AI TIER_1 English(EN) · Yuanyuan Jia, Shunpu Tang, Qianqian Yang ·

    CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding

    arXiv:2605.02218v1 Announce Type: new Abstract: Vision-language models (VLMs) have demonstrated strong capabilities in multimodal perception and reasoning. However, deploying large VLMs on mobile devices remains challenging due to their substantial computational and memory demand…

  296. Hugging Face Daily Papers TIER_1 English(EN) ·

    A foundation model of vision, audition, and language for in-silico neuroscience

    Cognitive neuroscience is fragmented into specialized models, each tailored to specific experimental paradigms, hence preventing a unified model of cognition in the human brain. Here, we introduce TRIBE v2, a tri-modal (video, audio and language) foundation model capable of predi…

  297. arXiv cs.AI TIER_1 English(EN) · Magdalena Katharina Wekenborg ·

    Quantifying the human visual exposome with vision language models

    The visual environment is a fundamental yet unquantified determinant of mental health. While the concept of the environmental exposome is well established, current methods rely on coarse geospatial proxies or biased self reports, failing to capture the first person visual context…

  298. arXiv cs.AI TIER_1 English(EN) · Shengzhao Wen ·

    MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models

    Multidimensional human understanding is essential for real-world applications such as film analysis and virtual digital humans, yet current LVLM benchmarks largely focus on single-task settings and lack fine-grained, human-centric evaluation. In this work, we introduce MHPR, a co…

  299. Hugging Face Daily Papers TIER_1 English(EN) ·

    Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model

    Source-Free Domain Adaptation (SFDA) adapts source models to target domains without accessing source data, addressing privacy and transmission issues. However, existing methods still initialize from a source pre-trained model and thus are not truly source-free. Recent works have …

  300. Hugging Face Daily Papers TIER_1 English(EN) ·

    SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters

    Vision Foundation Models (VFMs) pretrained on large-scale RGB data have demonstrated remarkable representation quality, yet their applicability to multispectral imaging spanning Near-Infrared (NIR), Short-Wave Infrared (SWIR), and Long-Wave Infrared (LWIR) remains largely unexplo…

  301. arXiv cs.AI TIER_1 English(EN) · Kenneth J. K. Ong ·

    The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models

    arXiv:2604.27953v1 Announce Type: new Abstract: As Vision-Language Models (VLMs) become increasingly integrated into decision-making systems, it is essential to understand how visual inputs influence their behavior. This paper investigates the effects of visual priming on VLMs' c…

  302. arXiv cs.AI TIER_1 English(EN) · Ken Deng, Yifu Qiu, Yoni Kasten, Shay B. Cohen, Yftah Ziser ·

    Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation

    arXiv:2601.22228v2 Announce Type: replace-cross Abstract: We study whether vision-language models (VLMs) can solve relative camera pose estimation (RCPE) from image pairs, a direct test of multi-view spatial reasoning. We cast RCPE as a discrete verbal classification task and int…

  303. arXiv cs.AI TIER_1 English(EN) · Santosh Vasa, Aditi Ramadwar, Jnana Rama Krishna Darabattula, Md Zafar Anwar, Stanislaw Antol, Andrei Vatavu, Thomas Monninger, Sihao Ding ·

    AutoVDC: Automated Vision Data Cleaning Using Vision-Language Models

    arXiv:2507.12414v2 Announce Type: replace-cross Abstract: Training of autonomous driving systems requires extensive datasets with precise annotations to attain robust performance. Human annotations suffer from imperfections, and multiple iterations are often needed to produce hig…

  304. arXiv cs.CL TIER_1 English(EN) · Chahat Raj, Bowen Wei, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu ·

    VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models

    arXiv:2505.22897v2 Announce Type: replace Abstract: While bias in large language models (LLMs) is well-studied, similar concerns in vision-language models (VLMs) have received comparatively less attention. Existing VLM bias studies often focus on portrait-style images and gender-…

  305. arXiv cs.CL TIER_1 English(EN) · Alice Plebe, Timothy Douglas, Diana Riazi, R. Maria del Rio-Chanona ·

    Images Amplify Misinformation Sharing in Vision-Language Models

    arXiv:2505.13302v2 Announce Type: replace Abstract: As language and vision-language models (VLMs) become central to information access and online interaction, concerns grow about their potential to amplify misinformation. Human studies show that images boost the perceived credibi…

  306. arXiv cs.CL TIER_1 English(EN) · Yuling Shi, Chaoxiang Xie, Zhensu Sun, Yeheng Chen, Chenxu Zhang, Longfei Yun, Chengcheng Wan, Hongyu Zhang, David Lo, Xiaodong Gu ·

    CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

    arXiv:2602.01785v2 Announce Type: replace Abstract: Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text…

  307. arXiv cs.AI TIER_1 English(EN) · Kaijun Zhou, Qiwei Chen, Da Peng, Zhiyang Li, Xijun Li, Jinyu Gu ·

    Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

    arXiv:2604.24447v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models are promising for generalist robot control, but on-robot deployment is bottlenecked by real-time inference under tight cost and energy budgets. Most prior evaluations rely on desktop-grade GPUs,…

  308. arXiv cs.CL TIER_1 English(EN) · Yiran Huang, Lukas Thede, Massimiliano Mancini, Wenjia Xu, Zeynep Akata ·

    Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

    arXiv:2604.24380v1 Announce Type: new Abstract: While Large Vision Language Models (LVLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose deployment challenges on resource-constrained edge devices. Current parameter reduction tec…

  309. arXiv cs.CL TIER_1 English(EN) · Qidong Wang, Junjie Hu, Ming Jiang ·

    V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models

    arXiv:2509.14837v2 Announce Type: replace Abstract: Recent advances in causal interpretability have extended from language models to vision-language models (VLMs), seeking to reveal their internal mechanisms through input interventions. While textual interventions often target se…

  310. arXiv cs.LG TIER_1 English(EN) · Charles Xu, Jost Tobias Springenberg, Michael Equi, Ali Amin, Adnan Esmail, Sergey Levine, Liyiming Ke ·

    RL Token: Bootstrapping Online RL with Vision-Language-Action Models

    arXiv:2604.23073v1 Announce Type: new Abstract: Vision-language-action (VLA) models can learn to perform diverse manipulation skills "out of the box," but achieving the precision and speed that real-world tasks demand requires further fine-tuning -- for example, via reinforcement…

  311. arXiv cs.AI TIER_1 English(EN) · Ziyao Wang, Bingying Wang, Hanrong Zhang, Tingting Du, Tianyang Chen, Guoheng Sun, Yexiao He, Zheyu Shen, Wanghao Ye, Ang Li ·

    Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

    arXiv:2604.23001v1 Announce Type: cross Abstract: Despite remarkable progress in Vision--Language--Action (VLA) models, a central bottleneck remains underexamined: the data infrastructure that underlies embodied learning. In this survey, we argue that future advances in VLA will …

  312. Hugging Face Daily Papers TIER_1 English(EN) ·

    CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

    Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade…

  313. Hugging Face Daily Papers TIER_1 English(EN) ·

    Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift

    Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modalit…

  314. arXiv cs.AI TIER_1 English(EN) · Jinyu Gu ·

    Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

    Vision-Language-Action (VLA) models are promising for generalist robot control, but on-robot deployment is bottlenecked by real-time inference under tight cost and energy budgets. Most prior evaluations rely on desktop-grade GPUs, obscuring the trade-offs and opportunities offere…

  315. Hugging Face Daily Papers TIER_1 English(EN) ·

    Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

    While Large Vision Language Models (LVLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose deployment challenges on resource-constrained edge devices. Current parameter reduction techniques primarily involve training LVLMs from sm…

  316. arXiv cs.CL TIER_1 English(EN) · Zeynep Akata ·

    Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

    While Large Vision Language Models (LVLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose deployment challenges on resource-constrained edge devices. Current parameter reduction techniques primarily involve training LVLMs from sm…

  317. arXiv cs.CL TIER_1 English(EN) · Etha Tianze Hua, Tian Yun, Ellie Pavlick ·

    Source-Modality Monitoring in Vision-Language Models

    arXiv:2604.22038v1 Announce Type: new Abstract: We define and investigate source-modality monitoring -- the ability of multimodal models to track and communicate the input source from which pieces of information originate. We consider source-modality monitoring as an instance of …

  318. Hugging Face Daily Papers TIER_1 English(EN) ·

    LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

    Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this issue by pruning unimportant visual tokens,…

  319. arXiv cs.CL TIER_1 English(EN) · Ellie Pavlick ·

    Source-Modality Monitoring in Vision-Language Models

    We define and investigate source-modality monitoring -- the ability of multimodal models to track and communicate the input source from which pieces of information originate. We consider source-modality monitoring as an instance of the more general binding problem, and evaluate t…

  320. Hugging Face Daily Papers TIER_1 English(EN) ·

    Prototype-Based Test-Time Adaptation of Vision-Language Models

    Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce …

  321. Hugging Face Daily Papers TIER_1 English(EN) ·

    Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

    Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video-LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In th…

  322. Hugging Face Daily Papers TIER_1 English(EN) ·

    More Than Meets the Eye: Measuring the Semiotic Gap in Vision-Language Models via Semantic Anchorage

    Vision-Language Models (VLMs) excel at photorealistic generation, yet often struggle to represent abstract meaning such as idiomatic interpretations of noun compounds. To study whether high visual fidelity interferes with idiomatic compositionality under visual abstraction, we in…

  323. arXiv cs.CV TIER_1 English(EN) · Wen Wen, Tianwu Zhi, Kanglong Fan, Yang Li, Xinge Peng, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang ·

    Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

    arXiv:2509.25787v5 Announce Type: replace Abstract: Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques hav…

  324. arXiv cs.CV TIER_1 English(EN) · Boyeong Im, Wooseok Lee, Yoojin Kwon, Hyung-Sin Kim ·

    Measurement Plasticity: Sensor-Level Adaptation for Vision-Language Models

    arXiv:2512.12571v3 Announce Type: replace Abstract: We propose Multi-View Physical-prompt (MVP) for Test-Time Adaptation (TTA), a forward-only framework that moves TTA from tokens to photons by treating the camera exposure triangle (i.e., ISO, shutter speed, and aperture) as phys…

  325. arXiv cs.CV TIER_1 English(EN) · Gokul Puthumanaillam, Vardhan Dongre, Pranay Thangeda, Hooshang Nayyeri, Dilek Hakkani-T\"ur, Melkior Ornik ·

    Trajectory-Level Redirection Attacks on Vision-Language-Action Models

    arXiv:2606.12978v1 Announce Type: cross Abstract: Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control …

  326. arXiv cs.CV TIER_1 English(EN) · Weide Liu, Wei Zhou, Jun Liu, Ping Hu, Jun Cheng, Jungong Han, Weisi Lin ·

    Modality-Aware Feature Matching in Visual and Vision-Language Applications: A Comprehensive Survey

    arXiv:2507.22791v2 Announce Type: replace Abstract: Feature matching is a cornerstone task in computer vision, essential for applications such as image retrieval, stereo matching, 3D reconstruction, and SLAM. This survey comprehensively reviews modality-based feature matching, ex…

  327. arXiv cs.CV TIER_1 English(EN) · Aswanth Krishnan ·

    Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

    Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its predictions causes catastrophic failure: [email protected] on …

  328. arXiv cs.CV TIER_1 English(EN) · Melkior Ornik ·

    Trajectory-Level Redirection Attacks on Vision-Language-Action Models

    Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning s…

  329. arXiv cs.CV TIER_1 English(EN) · Huaihai Lyu, Chaofan Chen, Yuheng Ji, Xiansheng Chen, Pengwei Wang, Shanghang Zhang, Changsheng Xu ·

    LAST: Bridging Vision-Language and Action Manifolds via Gromov-Wasserstein Alignment

    arXiv:2606.11221v1 Announce Type: new Abstract: We take a Gromov-Wasserstein perspective on Vision-Language-Action (VLA) learning, where the goal is to make the relational geometry of action representations compatible with the semantic geometry of VL embeddings. However, this ali…

  330. arXiv cs.CV TIER_1 English(EN) · Jin Yao, Dhruva Dixith Kurra, Tom Lampo, Zezhou Cheng, Danhua Guo, Burhan Yaman ·

    VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

    arXiv:2606.12396v1 Announce Type: new Abstract: Vision-language-action (VLA) models can describe scenes and reason about them in language, yet still struggle to ground their actions in the dense 3D world around them. Existing approaches either inject features from a frozen 3D fou…

  331. arXiv cs.CV TIER_1 English(EN) · Seokju Cho, Abhishek Badki, Hang Su, Jindong Jiang, Ziyao Zeng, Seungryong Kim, Sifei Liu, Orazio Gallo ·

    4DP-QA: Scalable QA for 4D Perception in Vision Language Models

    arXiv:2606.11568v1 Announce Type: new Abstract: Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about a 4D scene, challenging in itself, is further complicated by two factors. First, VLMs…

  332. arXiv cs.CV TIER_1 English(EN) · Burhan Yaman ·

    VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

    Vision-language-action (VLA) models can describe scenes and reason about them in language, yet still struggle to ground their actions in the dense 3D world around them. Existing approaches either inject features from a frozen 3D foundation model without an objective that ensures …

  333. arXiv cs.CV TIER_1 English(EN) · Jing Ma ·

    From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning

    Visual causal reasoning is essential for understanding and intervening in the physical world, requiring identification of causal variables from visual inputs and reasoning over intervention effects. Despite recent progress, large vision--language models (VLMs) remain brittle at s…

  334. arXiv cs.CV TIER_1 English(EN) · Yixuan Li, Yuhui Chen, Mingcai Zhou, Haoran Li, Zhengtao Zhang, Dongbin Zhao ·

    QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models

    arXiv:2510.14836v3 Announce Type: replace Abstract: Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential …

  335. arXiv cs.CV TIER_1 English(EN) · Yulin Chen, Zhihang Zhong, Yuenan Hou ·

    Segment and Select: Vision-Language Segmentation in 3D Scenarios

    arXiv:2606.10594v1 Announce Type: new Abstract: 3D vision-language segmentation aims to segment target objects in 3D scenarios according to the linguistic instructions and visual observations. Prior art heavily relies on the coarse superpoint representation to reduce the computat…

  336. arXiv cs.CV TIER_1 English(EN) · Yuenan Hou ·

    Segment and Select: Vision-Language Segmentation in 3D Scenarios

    3D vision-language segmentation aims to segment target objects in 3D scenarios according to the linguistic instructions and visual observations. Prior art heavily relies on the coarse superpoint representation to reduce the computation complexity, which suffers from poor segmenta…

  337. arXiv cs.CV TIER_1 English(EN) · Hao Shi, Weiye Li, Bin Xie, Yulin Wang, Renping Zhou, Tiancai Wang, Xiangyu Zhang, Ping Luo, Gao Huang ·

    MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

    arXiv:2606.09827v1 Announce Type: cross Abstract: Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and there…

  338. arXiv cs.CV TIER_1 English(EN) · Leiyu Wang, Zhaofengnian Wang, Xueqi Li, Luoyi Fan, Cewu Lu, Nanyang Ye ·

    Scaling by Diversified Experience for Vision-Language-Action Models

    arXiv:2606.09009v1 Announce Type: new Abstract: Vision-Language-Action models face significant challenges in real-world deployment due to the entanglement of high-level reasoning with low-level control, and the instability of policy optimization. In this paper, we introduce SyVLA…

  339. arXiv cs.CV TIER_1 English(EN) · Yizheng Sun, Mochuan Zhan, Yanan Ma, Jia Tong See, Yifan Wang, Ziyi Wang, Hao Li, Yang Cui, Wenhao Cai, Jingyu Sun, Chenghua Lin, Riza Batista-Navarro, Jingyuan Sun ·

    Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions?

    arXiv:2606.08894v1 Announce Type: new Abstract: Reasoning Vision-Language Models (VLMs) achieve strong performance on complex multimodal tasks, but reliable real-world application requires handling visual inputs that are messier than clean, curated benchmarks. Existing works main…

  340. arXiv cs.CV TIER_1 English(EN) · George Ling, Lijin Yang, Hao Yang, Zhongzhan Huang ·

    BLUE: Toward Better Language Use in Efficient Vision-Language-Action Models for Autonomous Driving

    arXiv:2606.08684v1 Announce Type: new Abstract: We present BLUE, a minimal method for better language use in vision-language-action (VLA) models for autonomous driving (AD). Through extensive analysis, we reveal that language matters on only a small fraction of routes, but on tho…

  341. arXiv cs.CV TIER_1 English(EN) · Arohan Agate ·

    Vision-Language Asymmetry in Bistable Image Captioning

    arXiv:2606.08031v1 Announce Type: new Abstract: Wittgenstein's duck-rabbit poses a question for vision-language models: when a model captions an ambiguous image, where in the model is the commitment to one aspect made? We address this with a 3,320-generation behavioral baseline o…

  342. arXiv cs.CV TIER_1 English(EN) · Sung-Wook Lee, Xuhui Kang, Yen-Ling Kuo ·

    TBD-VLA: Temporal Block Diffusion Vision Language Action Model

    arXiv:2606.07895v1 Announce Type: new Abstract: Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction over discretized action spaces, conditioning each token autoregressively on prior context. While effective, this paradigm in…

  343. arXiv cs.CV TIER_1 English(EN) · Lexin Wang, Shenghua Liu, Yiwei Wang, Jiafeng Guo, Xueqi Cheng ·

    Readable Yet Unpredictable: Rotated-Outcome Prediction in Vision-Language Models

    arXiv:2606.07641v1 Announce Type: new Abstract: Can vision-language models predict what a 180{\deg} rotation would reveal from the original image alone? We study this ability through Rotated-Outcome Prediction: given an original image, a model must answer what would be seen or re…

  344. arXiv cs.CV TIER_1 English(EN) · Clara Petrova, Zhuo Chen, Marin Solja\v{c}i\'c ·

    TraversalBench: Challenging Paths to Follow for Vision Language Models

    arXiv:2604.10999v2 Announce Type: replace Abstract: Vision-language models (VLMs) perform strongly on multimodal benchmarks, but their ability to follow complex visual paths remains under-tested. We introduce TraversalBench, a controlled benchmark for exact visual path traversal.…

  345. arXiv cs.CV TIER_1 English(EN) · Jiabing Yang, Yixiang Chen, Yuan Xu, Peiyan Li, Zichen Wen, Bowen Fang, Tao Yu, Xiangnan Wu, Qisen Ma, Kai Wang, Ziheng He, Yingda Li, Zhengbo Zhang, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang ·

    UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models

    arXiv:2602.18020v2 Announce Type: replace Abstract: Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance …

  346. arXiv cs.CV TIER_1 English(EN) · Gao Huang ·

    MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

    Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally depend…

  347. arXiv cs.CV TIER_1 English(EN) · Qian Zhang, Michal Golovanevsky, Fulvio Domini, James Tompkin ·

    Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception

    arXiv:2606.06714v1 Announce Type: new Abstract: Human perception of surface slant from texture exhibits systematic, graded biases that emerge reliably in psychophysical experiments. Prior work showed that unsupervised CNNs reproduce several human-like biases, while supervised CNN…

  348. arXiv cs.CV TIER_1 English(EN) · Zikai Zhang, Hubert P. H. Shum, Toby P. Breckon ·

    VeriDrive: Verifiable Counterfactual Supervision for Cost-Efficient Vision-Language Planning

    arXiv:2606.07338v1 Announce Type: new Abstract: Vision-language driving models increasingly use reasoning supervision to bridge perception, prediction, and planning, but existing driving rationales are often free-form and expensive to generate with frontier models. We present Ver…

  349. arXiv cs.CV TIER_1 English(EN) · Mengya Liu, Baoxiong Jia, Jiangyong Huang, Jingze Zhang, Siyuan Huang ·

    LARA: Latent Action Representation Alignment for Vision-Language-Action Models

    arXiv:2606.07100v1 Announce Type: new Abstract: Visual-language action (VLA) models enable robots to predict actions directly from observations and language instructions, but their performance depends on large-scale, high-quality data and is limited by the scarcity of real-world …

  350. arXiv cs.CV TIER_1 English(EN) · Aofei Chang, Le Huang, Alex James Boyd, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, Cao Xiao ·

    MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models

    arXiv:2606.06760v1 Announce Type: new Abstract: Medical large vision-language models (Med-LVLMs) have recently achieved remarkable progress in vision-language comprehension and medical image segmentation. However, existing models still struggle to unify these two capabilities, wh…

  351. arXiv cs.CV TIER_1 English(EN) · Toby P. Breckon ·

    VeriDrive: Verifiable Counterfactual Supervision for Cost-Efficient Vision-Language Planning

    Vision-language driving models increasingly use reasoning supervision to bridge perception, prediction, and planning, but existing driving rationales are often free-form and expensive to generate with frontier models. We present VeriDrive, a framework for constructing planning-or…

  352. arXiv cs.CV TIER_1 English(EN) · Liqiang Nie ·

    Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

    Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach\-es adopt a three-stage framework: a waypoint predictor proposes navigable waypoints, and …

  353. arXiv cs.CV TIER_1 English(EN) · Siyuan Huang ·

    LARA: Latent Action Representation Alignment for Vision-Language-Action Models

    Visual-language action (VLA) models enable robots to predict actions directly from observations and language instructions, but their performance depends on large-scale, high-quality data and is limited by the scarcity of real-world robot action datasets. To facilitate VLA model l…

  354. arXiv cs.CV TIER_1 Italiano(IT) · Yisen Wang ·

    Diagnosing Visual Ignorance in Vision-Language Models

    Vision-Language Models (VLMs) frequently rely on language priors, producing confident answers that are weakly grounded in visual evidence. While this behavior is widely observed, its internal mechanisms and its impact on benchmark evaluation remain insufficiently understood. In t…

  355. arXiv cs.CV TIER_1 English(EN) · Mengshi Qi, Wei Deng, Xianlin Zhang, Huadong Ma ·

    Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation

    arXiv:2606.06002v1 Announce Type: new Abstract: Large Vision-Language Models have achieved significant reasoning performance in various tasks.However, there are few studies on text-to-3D indoor scene generation with LVLMs. The main challenge is that prevailing LVLM-based methods …

  356. arXiv cs.CV TIER_1 English(EN) · XiuYu Zhang, Junfeng Fang, Zhenkai Liang ·

    Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

    arXiv:2606.05753v1 Announce Type: new Abstract: Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similari…

  357. arXiv cs.CV TIER_1 English(EN) · Pengcheng Zheng, Chaoning Zhang, Ya Wen, Wang Liu, Qigan Sun, Jiarong Mo, Jiaquan Zhang, Jewon Lee, Tae-Ho Kim, Kuien Liu, Tianyu Li, Caiyan Qin, Yang Yang ·

    Topology-Aware Layer Pruning for Large Vision-Language Models

    arXiv:2604.16502v2 Announce Type: replace Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning, while recent extensions that incorporate visual inputs enable them to process multimodal information. Despite th…

  358. arXiv cs.CV TIER_1 English(EN) · Walid Bousselham, Hilde Kuehne, Cordelia Schmid ·

    VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

    arXiv:2510.23497v3 Announce Type: replace Abstract: Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, b…

  359. arXiv cs.CV TIER_1 English(EN) · Qize Yu, Jiadi You, Yuran Wang, Jiaqi Liang, Bowen Ping, Yang Tian, Yue Chen, Minghong Cai, Zeying Gong, Ruihai Wu, Yinchuan Li, Junwei Liang, Yingcong Chen ·

    AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

    arXiv:2606.06155v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces a…

  360. arXiv cs.CV TIER_1 English(EN) · Haoyu Zhou, Qing Qing, Caichong Li, Qixin Zhang, Yongcheng Jing, Ziqi Xu, Juncheng Hu, Xikun Zhang, Renqiang Luo ·

    Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

    arXiv:2606.05702v1 Announce Type: cross Abstract: Recent advancements in Vision-Language Models (VLMs) have significantly enhanced their ability to interpret complex visual semantics, yet their capacity for chronological reasoning remains under-explored. In this paper, we introdu…

  361. arXiv cs.CV TIER_1 English(EN) · Zhipeng Chen ·

    MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models

    The new era has witnessed a remarkable capability to extend Vision-Language Models (VLMs) for tackling tasks of video understanding. While current VLMs excel at event- or story-level understanding, their ability to capture fine-grained motion details remains limited, primarily du…

  362. arXiv cs.CV TIER_1 English(EN) · Yingcong Chen ·

    AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

    Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the lea…

  363. arXiv cs.CV TIER_1 English(EN) · Huadong Ma ·

    Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation

    Large Vision-Language Models have achieved significant reasoning performance in various tasks.However, there are few studies on text-to-3D indoor scene generation with LVLMs. The main challenge is that prevailing LVLM-based methods employ chain-of-thought sequential decision mech…

  364. arXiv cs.CV TIER_1 English(EN) · Yin Li ·

    DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

    Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continu…

  365. arXiv cs.CV TIER_1 English(EN) · Zhenkai Liang ·

    Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

    Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the trai…

  366. arXiv cs.CV TIER_1 English(EN) · Xipeng Qiu ·

    Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

    Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and…

  367. arXiv cs.CV TIER_1 English(EN) · Renqiang Luo ·

    Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

    Recent advancements in Vision-Language Models (VLMs) have significantly enhanced their ability to interpret complex visual semantics, yet their capacity for chronological reasoning remains under-explored. In this paper, we introduce a novel benchmark specifically designed to eval…

  368. arXiv cs.CV TIER_1 English(EN) · Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Shahbaz Khan, Salman Khan ·

    Towards Evaluating the Robustness of Visual State Space Models

    arXiv:2406.09407v3 Announce Type: replace Abstract: Vision State Space Models (VSSMs), a novel architecture that combines the strengths of recurrent neural networks and latent variable models, have demonstrated remarkable performance in visual perception tasks by efficiently capt…

  369. arXiv cs.CV TIER_1 English(EN) · Jiaxin Shi, Xidong Zhang, Fucai Zhu, Zhe Li, Siyu Zhu, Weihao Yuan ·

    3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training

    arXiv:2606.04436v1 Announce Type: new Abstract: We propose a 3D-thinking-guided co-training framework that enables vision-language-action (VLA) models to perform 3D spatial reasoning implicitly during action prediction. Our core insight is that 3D geometry perception and 3D spati…

  370. arXiv cs.CV TIER_1 English(EN) · Sukriti Paul, Arpit Bansal, Tom Goldstein ·

    ChannelTok: Efficient Flexible-Length Vision Tokenization

    arXiv:2606.04461v1 Announce Type: new Abstract: Leading flexible vision tokenizers achieve SOTA quality at an extreme cost, relying on parameter-heavy backbones and slow, multi-step generative decoders. We depart from this complex, spatial-token paradigm and introduce a simple, l…

  371. arXiv cs.CV TIER_1 English(EN) · Jaden Park, Valentin Deschaintre, Jason Kuen, Kangning Liu, Iliyan Georgiev, Krishna Kumar Singh, Yong Jae Lee, Michael Fischer ·

    MAOAM: Unified Object and Material Selection with Vision-Language Models

    arXiv:2606.04880v1 Announce Type: new Abstract: Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interactions, and the system should suppo…

  372. arXiv cs.CV TIER_1 English(EN) · Yu Zhu, Yongkang Li, Wenjie Zhu, Haoyi Jiang, Wenyu Liu, Wei Yang, Bin Li, Xinggang Wang ·

    Food-R1: A Unified Multi-Task Food Vision-Language Model with Reinforcement Learning

    arXiv:2606.04986v1 Announce Type: new Abstract: Recent studies have explored Vision-Language Models (VLMs) for food analysis. However, most existing methods rely primarily on supervised fine-tuning (SFT), which often limits reasoning and generalization capabilities. Moreover, hig…

  373. arXiv cs.CV TIER_1 English(EN) · Xinggang Wang ·

    Food-R1: A Unified Multi-Task Food Vision-Language Model with Reinforcement Learning

    Recent studies have explored Vision-Language Models (VLMs) for food analysis. However, most existing methods rely primarily on supervised fine-tuning (SFT), which often limits reasoning and generalization capabilities. Moreover, high-quality large-scale nutritional annotations re…

  374. arXiv cs.CV TIER_1 English(EN) · Tom Goldstein ·

    ChannelTok: Efficient Flexible-Length Vision Tokenization

    Leading flexible vision tokenizers achieve SOTA quality at an extreme cost, relying on parameter-heavy backbones and slow, multi-step generative decoders. We depart from this complex, spatial-token paradigm and introduce a simple, lightweight, and fast channel-wise flexible-lengt…

  375. arXiv cs.CV TIER_1 English(EN) · Weihao Yuan ·

    3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training

    We propose a 3D-thinking-guided co-training framework that enables vision-language-action (VLA) models to perform 3D spatial reasoning implicitly during action prediction. Our core insight is that 3D geometry perception and 3D spatial reasoning are distinct capabilities that can …

  376. arXiv cs.CV TIER_1 English(EN) · Trevor Darrell ·

    Stateful Visual Encoders for Vision-Language Models

    Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains state…

  377. arXiv cs.CV TIER_1 English(EN) · Jizhihui Liu, Ruizi Han, Miao Zhang, Rui Shao, Xuebo Liu, Weili Guan, Yaowei Wang ·

    TGV-KV: Text-Grounded KV Eviction for Vision-Language Models

    arXiv:2606.03075v1 Announce Type: new Abstract: Vision-Language Models (VLMs) inherit the auto-regressive generation paradigm and cache the keys and values (KV) of all previous tokens to accelerate inference, resulting in memory consumption that scales linearly with context lengt…

  378. arXiv cs.CV TIER_1 English(EN) · Borong Zhang, Jiahao Li, Jiachen Shen, Yuhao Zhang, Yishuai Cai, Yuanpei Chen, Juntao Dai, Jiaming Ji, Yaodong Yang ·

    VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models

    arXiv:2512.22539v2 Announce Type: replace-cross Abstract: While Vision-Language-Action models (VLAs) are rapidly advancing towards generalist robot policies, it remains difficult to quantitatively understand their limits and failure modes. To address this, we introduce a comprehe…

  379. arXiv cs.CV TIER_1 English(EN) · Yuhang Yang, Jinhong Deng, Wen Li, Lixin Duan ·

    ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

    arXiv:2411.15851v2 Announce Type: replace Abstract: While vision-language models like CLIP have shown remarkable success in open-vocabulary tasks, their application is currently confined to image-level tasks, and they still struggle with dense predictions. Recent works often attr…

  380. arXiv cs.CV TIER_1 English(EN) · S Divakar Bhat, Toshihiko Yamasaki ·

    Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models

    arXiv:2606.02742v1 Announce Type: new Abstract: Spatial reasoning is fundamental to robotics, autonomy, and embodied AI, yet modern vision-language models (VLMs) remain unreliable on metric distance queries. A common assumption is that consistent predictions across viewpoints ref…

  381. arXiv cs.CV TIER_1 English(EN) · Wei Yang ·

    PRISM: Synergizing Vision Foundation Models via Self-organized Expert Specialization

    Unifying the complementary strengths of diverse Vision Foundation Models (VFMs) into a single efficient model is highly desirable but challenged by the negative transfer inherent in monolithic distillation. To address these feature conflicts, we introduce \textbf{PRISM}, a novel …

  382. arXiv cs.CV TIER_1 English(EN) · Muyi Bao, Yuxin Cai, Hang Xu, Zongtai Li, Jinxi He, Jingfan Tang, Chen Lv, Ji Zhang, Yaqi Xie, Wenshan Wang ·

    Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

    arXiv:2606.01621v1 Announce Type: new Abstract: Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is am…

  383. arXiv cs.CV TIER_1 English(EN) · Tingjun Dai, Mingfei Han, Tingwen Du, Zhiheng Liu, Zihao Zhang, Zhihui Li, Salman Khan, Jun Yu, Xiaojun Chang ·

    See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation

    arXiv:2603.09292v2 Announce Type: replace-cross Abstract: Measurement of task progress through explicit, actionable milestones is critical for robust robotic manipulation. This progress awareness enables a model to ground its current task status, anticipate verifiable intermediat…

  384. arXiv cs.CV TIER_1 English(EN) · Xiang Fang, Wanlong Fang, Changshuo Wang ·

    Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation

    arXiv:2606.01565v1 Announce Type: cross Abstract: Vision-Language Navigation in Continuous Environments (VLN-CE) poses a formidable challenge for autonomous agents, requiring seamless integration of natural language instructions and visual observations to navigate complex 3D indo…

  385. arXiv cs.CV TIER_1 English(EN) · Xiao Wang, Ibrahim Alabdulmohsin, Daniel Salz, Zhe Li, Keran Rong, Xiaohua Zhai ·

    Scaling Pre-training to One Hundred Billion Data for Vision Language Models

    arXiv:2502.07617v2 Announce Type: replace Abstract: We provide an empirical investigation of the potential of pre-training vision-language models on an unprecedented scale: 100 billion examples. We find that model performance tends to saturate at this scale on many common Western…

  386. arXiv cs.CV TIER_1 English(EN) · Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu, Haoyun Jiang, Liu Yang, Qiang Hu, Guangtao Zhai, Xiaoyun Zhang ·

    LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models

    arXiv:2606.02535v1 Announce Type: new Abstract: Large-scale generative models have demonstrated remarkable capabilities across image generation and editing tasks. However, their performance in low-level vision tasks, which require pixel-wise control, remains insufficiently studie…

  387. arXiv cs.CV TIER_1 English(EN) · Stephen James Krol, Jon McCormack ·

    Evolving to the Aesthetics of a Vision-Language Model

    arXiv:2606.00112v1 Announce Type: cross Abstract: Evolutionary systems have demonstrated remarkable results in creative domains, with recent applications in generative typography, design, and music. However, an open problem remains in designing fitness functions that effectively …

  388. arXiv cs.CV TIER_1 English(EN) · Xiaoyun Zhang ·

    LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models

    Large-scale generative models have demonstrated remarkable capabilities across image generation and editing tasks. However, their performance in low-level vision tasks, which require pixel-wise control, remains insufficiently studied. To address this gap, we introduce \textbf{LL-…

  389. arXiv cs.CV TIER_1 English(EN) · Xiang Fang, Wanlong Fang ·

    SLAP: The Semantic Least Action Principle for Variational Video-Language Modeling

    arXiv:2605.30750v1 Announce Type: new Abstract: In the era of Large Video-Language Models (LVLMs), the computational necessity of sparse frame sampling creates a fundamental ``temporal gap'', rendering models blind to critical causal transitions. Existing solutions relying on gen…

  390. arXiv cs.CV TIER_1 English(EN) · Zihu Wang, Karthik Somayaji N. S, Peng Li ·

    ReGuLaR: Relation-Grounded Latent Reasoning for Large Vision-Language Models

    arXiv:2605.30587v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning has significantly improved the reasoning ability of large vision-language models (LVLMs) by verbalizing intermediate reasoning steps in natural language. However, such discrete textual rationales are…

  391. arXiv cs.CV TIER_1 English(EN) · Weicheng Zheng, Yixin Huang, Qiao Sun, Derun Li, Hang Zhao ·

    DriveMA: Driving Vision-Language-Action Models with verifiable Meta-Actions

    arXiv:2605.31271v1 Announce Type: new Abstract: Driving Vision-Language-Action Models (Driving VLAs) aim to use language to improve end-to-end planning, but the language-action gap limits this promise. We propose DriveMA, a Driving VLA framework built on verifiable meta-actions, …

  392. arXiv cs.CV TIER_1 English(EN) · Runze Cheng, Yao Sun, Ahmad Taha, Xuesong Liu, David Flynn, Muhammad Ali Imran ·

    A Survey on Semantic Communication for Vision: Categories, Frameworks, Enabling Techniques, and Applications

    arXiv:2601.22202v2 Announce Type: replace-cross Abstract: Semantic communication (SemCom) emerges as a transformative paradigm for traffic-intensive visual data transmission, shifting focus from raw data to meaningful content transmission and relieving the increasing pressure on …

  393. arXiv cs.CV TIER_1 English(EN) · Olaf D\"unkel, Basavaraj Sunagad, Haoran Wang, David T. Hoffmann, Christian Theobalt, Adam Kortylewski ·

    SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

    arXiv:2605.31597v1 Announce Type: new Abstract: Measuring structured object understanding in vision foundation models remains challenging due to inconsistent evaluation protocols and limited part-level supervision. Semantic correspondence (SC) evaluates this capability by testing…

  394. arXiv cs.CV TIER_1 English(EN) · Yanshu Li, Jiaqian Li, Kuai Yu, Xi Xiao, Dongfang Liu, Tianyang Wang, Ruixiang Tang ·

    Personalize Your Large Vision-language Models With In-context Prompt Tuning

    arXiv:2605.31513v1 Announce Type: new Abstract: Large vision-language models (LVLMs) have demonstrated strong general multimodal capability and are increasingly deployed in downstream systems. This trend has driven growing interest in LVLM personalization, which aims to enable mo…

  395. arXiv cs.CV TIER_1 English(EN) · Adam Kortylewski ·

    SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

    Measuring structured object understanding in vision foundation models remains challenging due to inconsistent evaluation protocols and limited part-level supervision. Semantic correspondence (SC) evaluates this capability by testing whether object parts can be matched across inst…

  396. arXiv cs.CV TIER_1 English(EN) · Ruixiang Tang ·

    Personalize Your Large Vision-language Models With In-context Prompt Tuning

    Large vision-language models (LVLMs) have demonstrated strong general multimodal capability and are increasingly deployed in downstream systems. This trend has driven growing interest in LVLM personalization, which aims to enable models to quickly and effectively learn out-of-dis…

  397. arXiv cs.CV TIER_1 English(EN) · Hang Zhao ·

    DriveMA: Driving Vision-Language-Action Models with verifiable Meta-Actions

    Driving Vision-Language-Action Models (Driving VLAs) aim to use language to improve end-to-end planning, but the language-action gap limits this promise. We propose DriveMA, a Driving VLA framework built on verifiable meta-actions, which summarize future ego motion into compact l…

  398. arXiv cs.CV TIER_1 English(EN) · An-Chieh Cheng, Yang Fu, Yatai Ji, Ligeng Zhu, Guanqi Zhan, Zhuoyang Zhang, Zhaojing Yang, Song Han, Yao Lu, Pavlo Molchanov, Vidya Nariyambut Murali, Jan Kautz, Xiaolong Wang, Hongxu Yin, Sifei Liu ·

    Grounded 3D-Aware Spatial Vision-Language Modeling

    arXiv:2605.30307v1 Announce Type: new Abstract: We present GR3D, a spatial vision language model equipped with three complementary grounding capabilities--explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding--within a single framework. GR3D introduces an impli…

  399. arXiv cs.CV TIER_1 English(EN) · Zhongyu Xia, Yousen Tang, Bingqing Wei, Yongtao Wang ·

    3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding

    arXiv:2605.29416v1 Announce Type: cross Abstract: Vision-Language-Action models have achieved remarkable progress in robotic manipulation, yet they suffer from a critical limitation: a lack of 3D scene understanding. This deficiency manifests as three intertwined challenges: weak…

  400. arXiv cs.CV TIER_1 English(EN) · John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, Jinwoo Shin ·

    Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

    arXiv:2510.27607v3 Announce Type: replace Abstract: Augmenting vision-language-action models (VLAs) with world models is promising for robotic policy learning but faces challenges in jointly predicting states and actions due to the modality gap. To address this, we propose DUal-S…

  401. arXiv cs.CV TIER_1 English(EN) · Shilin Ma, Chubin Zhang, Changyuan Wang, Yuji Wang, Yue Wu, Zixuan Wang, Jingqi Tian, Zheng Zhu, Yansong Tang ·

    SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation

    arXiv:2605.29662v1 Announce Type: new Abstract: Real-time inference of vision-language-action (VLA) models is essential for robotic control. While visual token pruning has shown strong potential for accelerating inference, most existing methods mainly base pruning decisions on sh…

  402. arXiv cs.CV TIER_1 English(EN) · Kyujin Lee, Injae Kim, Jihwan Park, Yejun Ju, Minseok Joo, Hyunwoo J. Kim ·

    Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning

    arXiv:2605.29577v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have emerged as a promising framework that unifies perception, reasoning, and control for robot manipulation by adapting pretrained vision-language models (VLMs) to action prediction. However, VLM…

  403. arXiv cs.CV TIER_1 English(EN) · Fengshun Wang, Zhengbo Zhang, Zhigang Tu ·

    Masked Diffusion Vision-Language Models for Temporal Action Localization

    arXiv:2605.29858v1 Announce Type: new Abstract: Temporal action localization (TAL) requires recognizing the target event and localizing its start and end times precisely in untrimmed videos. Recent vision-language formulations improve semantic reasoning and support language-condi…

  404. arXiv cs.CV TIER_1 English(EN) · Cheolhong Min, Jaeyun Jung, Daeun Lee, Hyeonseong Jeon, Yu Su, Jonathan Tremblay, Chan Hee Song, Jaesik Park ·

    Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

    arXiv:2605.30161v1 Announce Type: new Abstract: Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce …

  405. arXiv cs.CV TIER_1 English(EN) · Sifei Liu ·

    Grounded 3D-Aware Spatial Vision-Language Modeling

    We present GR3D, a spatial vision language model equipped with three complementary grounding capabilities--explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding--within a single framework. GR3D introduces an implicit grounding mechanism that identifies entity m…

  406. arXiv cs.CV TIER_1 English(EN) · Jaesik Park ·

    Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

    Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that c…

  407. arXiv cs.CV TIER_1 English(EN) · Yueting Zhuang ·

    VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

    Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive tex…

  408. arXiv cs.CV TIER_1 English(EN) · Zhigang Tu ·

    Masked Diffusion Vision-Language Models for Temporal Action Localization

    Temporal action localization (TAL) requires recognizing the target event and localizing its start and end times precisely in untrimmed videos. Recent vision-language formulations improve semantic reasoning and support language-conditioned outputs, but their autoregressive decoder…

  409. arXiv cs.CV TIER_1 English(EN) · Xiang Fang, Wanlong Fang, Changshuo Wang, Keke Tang, Daizong Liu, Siyi Wang, Wei Ji ·

    Towards Unified Vision-Language Models with Incomplete Multi-Modal Inputs

    arXiv:2605.27894v1 Announce Type: new Abstract: Video-Language Models (VLMs) have demonstrated impressive multi-modal reasoning capabilities across diverse computer vision applications. However, these VLMs are task-specific and assume that both video and language inputs are compl…

  410. arXiv cs.CV TIER_1 English(EN) · Landi He, Mingde Yao, Shawn Young, Lijian Xu ·

    Beyond Surrogate Gradients: Fully Differentiable Token Pruning for Vision-Language Models

    arXiv:2605.28051v1 Announce Type: new Abstract: Visual token pruning reduces the computational cost of Vision-Language Models (VLMs) by removing redundant visual tokens. Existing methods typically rely on Gumbel-Softmax to approximate discrete selection during training. However, …

  411. arXiv cs.CV TIER_1 English(EN) · Haiwen Diao, Jiahao Wang, Penghao Wu, Yuhao Dong, Yuwei Niu, Yue Zhu, Zhongang Cai, Weichen Fan, Linjun Dai, Silei Wu, Xuanyu Zheng, Mingxuan Li, Yuanhan Zhang, Bo Li, Hanming Deng, Huchuan Lu, Quan Wang, Lei Yang, Lewei Lu, Dahua Lin, Ziwei Liu ·

    From Pixels to Words -- Towards Native One-Vision Models at Scale

    arXiv:2605.28820v1 Announce Type: new Abstract: Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters ea…

  412. arXiv cs.CV TIER_1 English(EN) · Ruiyan Gong, Meisheng Zhang, Yuxiang Zhao, Mingchao Sun, Yanfen Shen, Zedong Chu, Zhining Gu, Wei Guo, Xiaolong Cheng, Qiming Li, Kangning Niu, Yanqing Zhu, Xiaolong Wu, Tianlun Li, Mu Xu ·

    POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation

    arXiv:2605.28237v1 Announce Type: cross Abstract: Real-world navigation is fundamentally driven by Points of Interest (POIs), yet reaching a precise POI remains a critical "final-meters" challenge. Existing Vision-Language Navigation (VLN) benchmarks of POI-goal navigation often …

  413. arXiv cs.CV TIER_1 English(EN) · Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie ·

    CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

    arXiv:2508.21046v3 Announce Type: replace Abstract: Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a …

  414. arXiv cs.CV TIER_1 English(EN) · Lingyu Xiong, Jinjin Shi, Xuran Xu, Cong Luo, Runyu Shi, Ying Huang ·

    SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model Adaptation

    arXiv:2605.27893v1 Announce Type: new Abstract: Vision Foundation Models (VFMs) have demonstrated impressive representational capabilities. However, adapting them to downstream tasks via full fine-tuning incurs prohibitive computational and storage overhead. Parameter-Efficient F…

  415. arXiv cs.CV TIER_1 English(EN) · Corentin Seutin, Mohamed Amine Ettaki, Micha\"el Cl\'ement, Pierrick Coup\'e, R\'emi Giraud ·

    Toward Semantic-Agnostic and Shape-Aware Vision-Language Segmentation Models

    arXiv:2605.28348v1 Announce Type: new Abstract: Vision-language segmentation models have recently achieved strong performance by leveraging high-level semantic object categories expressed in natural language. However, this semantic dependence limits their ability to reason about …

  416. arXiv cs.CV TIER_1 English(EN) · Hongyu Ding, Sizhuo Zhang, Ziming Xu, Jinwen Guo, Hongxiu Liu, Xingzhi Cheng, Zixuan Chen, Haifei Qi, Duo Wang, Hao Xu, Jieqi Shi, Yifan Zhang, Jing Huo, Jian Cheng, Yang Gao, Jiebo Luo ·

    Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation

    arXiv:2605.27582v1 Announce Type: cross Abstract: Embodied navigation requires an agent to map language and visual observations to a stream of spatial actions that drive a real robot through environments it has never seen. The dominant approach has been to scale vision-language-a…

  417. arXiv cs.CV TIER_1 English(EN) · Ziwei Liu ·

    From Pixels to Words -- Towards Native One-Vision Models at Scale

    Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixel-word interactions. In parallel, native…

  418. arXiv cs.CV TIER_1 English(EN) · Peng Lu ·

    Ω-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling

    Vision-Language-Action (VLA) models unify perception, reasoning, and control within a single policy, yet their multi-billion-parameter backbones and diffusion-based action heads make on-device deployment prohibitively expensive. Prior quantization efforts offer only partial solut…

  419. arXiv cs.CV TIER_1 English(EN) · Mohammad Rashedul Hasan ·

    The Abstraction Gap in Vision-Language Causal Reasoning

    Vision-language models (VLMs) generate fluent causal explanations, but current evaluations cannot distinguish linguistic plausibility from faithful causal reasoning. We introduce a dual-probe methodology that isolates these properties. The Text-Only Probe measures linguistic qual…

  420. arXiv cs.CV TIER_1 English(EN) · Rémi Giraud ·

    Toward Semantic-Agnostic and Shape-Aware Vision-Language Segmentation Models

    Vision-language segmentation models have recently achieved strong performance by leveraging high-level semantic object categories expressed in natural language. However, this semantic dependence limits their ability to reason about intrinsic visual properties such as shape, geome…

  421. arXiv cs.CV TIER_1 English(EN) · Mu Xu ·

    POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation

    Real-world navigation is fundamentally driven by Points of Interest (POIs), yet reaching a precise POI remains a critical "final-meters" challenge. Existing Vision-Language Navigation (VLN) benchmarks of POI-goal navigation often suffer from coarse granularity or significant sim-…

  422. arXiv cs.CV TIER_1 English(EN) · Lijian Xu ·

    Beyond Surrogate Gradients: Fully Differentiable Token Pruning for Vision-Language Models

    Visual token pruning reduces the computational cost of Vision-Language Models (VLMs) by removing redundant visual tokens. Existing methods typically rely on Gumbel-Softmax to approximate discrete selection during training. However, the optimization is driven by surrogate gradient…

  423. arXiv cs.CV TIER_1 English(EN) · Jianzhe Gao, Rui Liu, Wenguan Wang ·

    3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation

    arXiv:2605.26500v1 Announce Type: new Abstract: Vision-language navigation (VLN) requires an agent to traverse complex 3D environments based on natural language instructions, necessitating a thorough scene understanding. While existing works equip agents with various scene repres…

  424. arXiv cs.CV TIER_1 English(EN) · Senyuan Shi, Hao Tan, Zichang Tan, Shuhan Feng, Ajian Liu, Sergio Escalera, Jun Wan ·

    HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection

    arXiv:2605.26421v1 Announce Type: new Abstract: The rapid evolution of generative models has precipitated a proliferation of fabricated content, posing significant challenges to existing Synthetic Image Detection (SID) methods. Capitalizing on advancements in vision-language mode…

  425. arXiv stat.ML TIER_1 English(EN) · Qilin Liao, Anamika Lochab, Ruqi Zhang ·

    VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

    arXiv:2510.17759v2 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brit…

  426. arXiv cs.CV TIER_1 English(EN) · Yujie Lin, Kaidi Jia, Jiayao Ma, Chengyi Yang, Jinsong Su ·

    On the Robustness of Machine Unlearning for Vision-Language Models

    arXiv:2605.26992v1 Announce Type: new Abstract: Vision-language models (VLMs) may memorize undesirable information from training data, motivating growing interest in machine unlearning. In this work, we present the first systematic survey and robustness analysis of VLM unlearning…

  427. arXiv cs.CV TIER_1 English(EN) · Jianzhe Gao, Rui Liu, Yuxuan Xu, Tongtong Cao, Yingxue Zhang, Zhanguang Zhang, Sida Peng, Yi Yang, Wenguan Wang ·

    Uncertainty-Aware Gaussian Map for Vision-Language Navigation

    arXiv:2605.26503v1 Announce Type: new Abstract: Vision-Language Navigation (VLN) requires an agent to navigate 3D environments following natural language instructions. During navigation, existing agents commonly encounter perceptual uncertainty, such as insufficient evidence for …

  428. arXiv cs.CV TIER_1 English(EN) · Guixian Xu, Yide Liang, Zeli Su, Xuexian Song, Ziyin Zhang, Yushuang Dong, Ting Zhang, Xu Han ·

    FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling

    arXiv:2605.26601v1 Announce Type: new Abstract: Vision-language models have progressed rapidly, but Tibetan remains a severely underserved low-resource language due to the lack of reproducible training and evaluation infrastructure. To fill this gap, we introduce FTibSuite, a com…

  429. arXiv cs.CV TIER_1 English(EN) · Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng, Bing Wang, Zhixing Tan ·

    DV-SFT: Direct Vision Supervision for Fine-Grained Visual Understanding

    arXiv:2605.26656v1 Announce Type: new Abstract: Multimodal large language models are typically trained end-to-end to predict ground-truth answers, yet supervision signals are applied exclusively to text tokens. Visual tokens, the core carriers of visual information, are optimized…

  430. arXiv cs.CV TIER_1 English(EN) · Joseph Hoche, David Brellmann, Gianni Franchi ·

    Leveraging Visual Signals for Robust Token-Level Uncertainty in Vision-Language Generation

    arXiv:2605.27136v1 Announce Type: new Abstract: Uncertainty quantification (UQ) remains a critical challenge in Large Vision Language Models (LVLMs) for reliable predictions and real-world deployment. However, most existing methods are adapted from the LLM literature and primaril…

  431. arXiv cs.CV TIER_1 English(EN) · Aaron Branson Cigres Li, Zhaowei Wang, Yu Zhao, Yiming Du, Haobo Li, Xiyu Ren, Ginny Wong, Simon See, Lishu Luo, Haodong Duan, Pasquale Minervini, Yangqiu Song ·

    Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models

    arXiv:2605.27243v1 Announce Type: new Abstract: Large vision-language models increasingly rely on long-context modeling to reason over documents, hour-level videos, and long-horizon agent trajectories, requiring them to locate relevant evidence across interleaved text and images.…

  432. arXiv cs.CV TIER_1 English(EN) · Chaoran Xu, Yingmao Miao, Pengfei Zhang, Hao Dou, Lei Sun, Xiangxiang Chu ·

    RISE: Reliable Improvement in Self-Evolving Vision-Language Models

    arXiv:2605.20914v2 Announce Type: replace Abstract: Vision-language models (VLMs) have achieved strong multimodal reasoning capabilities, but further improving them still relies heavily on large-scale human-constructed supervision for post-training. Such supervision is costly to …

  433. arXiv cs.CV TIER_1 English(EN) · Yangqiu Song ·

    Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models

    Large vision-language models increasingly rely on long-context modeling to reason over documents, hour-level videos, and long-horizon agent trajectories, requiring them to locate relevant evidence across interleaved text and images. Prior work has studied this behavior using retr…

  434. arXiv cs.CV TIER_1 English(EN) · Gianni Franchi ·

    Leveraging Visual Signals for Robust Token-Level Uncertainty in Vision-Language Generation

    Uncertainty quantification (UQ) remains a critical challenge in Large Vision Language Models (LVLMs) for reliable predictions and real-world deployment. However, most existing methods are adapted from the LLM literature and primarily focus on the language modality, leaving the co…

  435. arXiv cs.CV TIER_1 English(EN) · Jinsong Su ·

    On the Robustness of Machine Unlearning for Vision-Language Models

    Vision-language models (VLMs) may memorize undesirable information from training data, motivating growing interest in machine unlearning. In this work, we present the first systematic survey and robustness analysis of VLM unlearning. We provide a comprehensive taxonomy and review…

  436. arXiv cs.CV TIER_1 English(EN) · Zhixing Tan ·

    DV-SFT: Direct Vision Supervision for Fine-Grained Visual Understanding

    Multimodal large language models are typically trained end-to-end to predict ground-truth answers, yet supervision signals are applied exclusively to text tokens. Visual tokens, the core carriers of visual information, are optimized only implicitly as part of the context, leading…

  437. arXiv cs.CV TIER_1 English(EN) · Kaixiang Chen, Pengfei Fang, Hui Xue ·

    MAIL++: Multi-Modal Bi-directional Agent Layer for Vision-Language Models

    arXiv:2605.25479v1 Announce Type: new Abstract: Adapting large vision-language models (VLMs) such as CLIP to downstream tasks remains challenging, as full fine-tuning is computationally prohibitive and prone to overfitting in low-data regimes. Parameter-efficient fine-tuning (PEF…

  438. arXiv cs.CV TIER_1 English(EN) · Longteng Guo, Yifan Wang, Pengkang Huo, Tailai Chen, Yuze Wu, Jing Liu, Xinxin Zhu ·

    Can MLLMs Reason Beyond Language? VisReason: A Comprehensive Benchmark for Vision-Centric Reasoning

    arXiv:2605.25364v1 Announce Type: new Abstract: Recent multimodal large language models (MLLMs) achieve strong performance on visual reasoning benchmarks, yet it remains unclear to what extent such performance reflects reasoning directly grounded in visual evidence. We introduce …

  439. arXiv cs.CV TIER_1 English(EN) · Alexey Kravets, Da Li, Chuan Li, Da Chen, Vinay P. Namboodiri ·

    Interpretability Transfer from Language to Vision via Sparse Autoencoders

    arXiv:2605.24946v1 Announce Type: new Abstract: Recent advances in language model interpretability using sparse autoencoders (SAEs) have yet to effectively translate to the visual domain, mainly due to the difficulty and ambiguity of labeling visual concepts. In this paper, we in…

  440. arXiv cs.CV TIER_1 English(EN) · Wenhui Chu ·

    RepSAM: Bridging Foundation Models to Robotic Vision via Representation-Guided Adaptation

    arXiv:2605.25495v1 Announce Type: cross Abstract: Robotic perception in unstructured environments remains challenging despite the zero-shot capabilities of foundation models such as SAM. This work attributes performance degradation to non-uniform representation shifts across tran…

  441. arXiv cs.CV TIER_1 English(EN) · Yurou Yang, Muyuan Lin, Roberto Martin-Martin, Martin Labrie, Shreekant Gayaka, Cheng-Hao Kuo, Luca Carlone ·

    Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models

    arXiv:2605.24642v1 Announce Type: new Abstract: Recent work explores new opportunities at the intersection of vision-language-action models (VLAs) and geometric foundation models (GFMs) for 3D reconstruction, such as VGGT. While the resulting geometric VLAs often show improved pe…

  442. arXiv cs.CV TIER_1 English(EN) · Xiao Liu, Jiaxiang Liu, Boci Peng, Boren Hu, Yusong Wang, Xiwen Chen, Prayag Tiwari, Liming Zhang, Mingkun Xu ·

    Closed-Loop Bidirectional Prompting for Adversarial Robustness of Vision Language Models

    arXiv:2605.25922v1 Announce Type: new Abstract: Vision Language Models adapt well to downstream tasks but are highly vulnerable to adversarial perturbations that disrupt cross-modal semantic alignment. Existing defenses are largely unidirectional or structural, failing to exploit…

  443. arXiv cs.CV TIER_1 English(EN) · Xuan Wang, Yinan Wu, Haoran Duan, Jungong Han ·

    QuoVLA: Quotient Space for Vision-Language-Action Models

    arXiv:2605.24890v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models commonly adapt pretrained Vision-Language Models (VLMs) to robot control by mapping visual observations and language instructions to continuous actions. Existing approaches typically take an actio…

  444. arXiv cs.CV TIER_1 English(EN) · Chris Ge, Rohit Gandikota, Antonio Torralba, Tamar Rott Shaham ·

    Vision-Language Binding in In-Context Image Generation

    arXiv:2605.24624v1 Announce Type: new Abstract: In-context image generation models such as FLUX.2 take a text prompt and an optional reference image as visual conditioning for the output. Internally, all three inputs -- text, reference image, and the noise tokens -- are concatena…

  445. arXiv cs.CV TIER_1 English(EN) · Weikang Qiu, Huashuo Lei, Tinglin Huang, Rex Ying ·

    Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement

    arXiv:2602.03983v3 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observation…

  446. arXiv cs.CV TIER_1 English(EN) · Jaeha Choi, Jin Won Lee, Siwoo You, Jangho Lee ·

    It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models

    arXiv:2603.08011v2 Announce Type: replace Abstract: Advances in vision-language models (VLMs) have achieved remarkable success on complex multimodal reasoning tasks, leading to the assumption that they should also excel at reading analog clocks. However, contrary to this expectat…

  447. arXiv cs.CV TIER_1 English(EN) · Weijie Wei, Zhipeng Luo, Ling Feng, Venice Erin Liong ·

    Spatial-aware Vision Language Model for Autonomous Driving

    arXiv:2512.24331v2 Announce Type: replace Abstract: While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decis…

  448. arXiv cs.CV TIER_1 English(EN) · Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, Lichao Sun ·

    LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

    arXiv:2510.03827v2 Announce Type: replace Abstract: LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and prev…

  449. arXiv cs.CV TIER_1 English(EN) · Kaushik Roy ·

    MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

    Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uni…

  450. arXiv cs.CV TIER_1 English(EN) · Mingkun Xu ·

    Closed-Loop Bidirectional Prompting for Adversarial Robustness of Vision Language Models

    Vision Language Models adapt well to downstream tasks but are highly vulnerable to adversarial perturbations that disrupt cross-modal semantic alignment. Existing defenses are largely unidirectional or structural, failing to exploit bidirectional cross-modal complementarity and i…

  451. arXiv cs.CV TIER_1 English(EN) · Zixuan Hu, Xuantuo Huang, Yancheng Li, Yichun Hu, Shengyong Xu, Ling-Yu Duan ·

    Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation

    arXiv:2605.23257v1 Announce Type: cross Abstract: Navigating under non-stationary environment shifts poses a critical challenge for a Vision-and-Language Navigation (VLN) agent deployed in the wild. Yet, existing Test-Time Adaptation (TTA) methods for VLN largely treat online ada…

  452. arXiv cs.CV TIER_1 English(EN) · Kuk-Jin Yoon ·

    Multimodal Distribution Matching for Vision-Language Dataset Distillation

    Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal align…

  453. arXiv cs.CV TIER_1 English(EN) · Mu Yang ·

    ChainFlow-VLA: Causal Flow Planning with Vision-Language Models

    Current end-to-end autonomous driving systems are fundamentally limited by a mismatch between temporal causal reasoning and global trajectory consistency. Autoregressive (AR) models capture interaction-aware temporal dependencies via causal factorization, but their step-wise deco…

  454. arXiv cs.CV TIER_1 English(EN) · Ling-Yu Duan ·

    Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation

    Navigating under non-stationary environment shifts poses a critical challenge for a Vision-and-Language Navigation (VLN) agent deployed in the wild. Yet, existing Test-Time Adaptation (TTA) methods for VLN largely treat online adaptation as transient, isolated updates, leading to…

  455. arXiv cs.CV TIER_1 English(EN) · Bingjun Luo, Tony Wang, Hanqi Chen, Xinpeng Ding ·

    Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding

    arXiv:2605.22078v1 Announce Type: cross Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly advanced video understanding tasks, yet challenges remain in efficiently compressing visual tokens while preserving spatiotemporal interactions. Existi…

  456. arXiv cs.CV TIER_1 English(EN) · Tianyi Zhang, Mahtab Bigverdi, Ranjay Krishna ·

    Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?

    arXiv:2605.21642v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly augmented with continuous or latent non-textual tokens intended to support "visual thinking." Despite improved task accuracy, this alone does not show that models actually use these tok…

  457. arXiv cs.CV TIER_1 English(EN) · Zhi Liu ·

    CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models

    arXiv:2605.21854v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have rapidly converged on a small set of architectural patterns: discrete-token autoregression (e.g. OpenVLA) and continuous-action flow-matching (e.g. pi-0.5). Yet preference alignment via Direct…

  458. arXiv cs.CV TIER_1 English(EN) · Ruiqi Liu, Xiaolei Lv, Gengsheng Li, Ximo Zhu, Zhiheng Wang, Zhengbo Zhang, Junkai Chen, Zhiheng Li, Bo Li, Jun Gao, Shu Wu ·

    Visual-Advantage On-Policy Distillation for Vision-Language Models

    arXiv:2605.21924v1 Announce Type: new Abstract: On-policy knowledge distillation has proven effective for language models, yet its application to vision-language models (VLMs) remains underexplored. We observe that standard on-policy distillation can improve a student's output qu…

  459. arXiv cs.CV TIER_1 English(EN) · Chengsheng Zhang, Chenghao Sun, Zhining Xie, Xinmei Tian ·

    Interpreting and Enhancing Emotional Circuits in Large Vision-Language Models via Cross-Modal Information Flow

    arXiv:2605.21980v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) represent a significant leap towards empathetic agents, demonstrating remarkable capabilities in emotion understanding. However, the internal mechanisms governing how LVLMs translate abstract vis…

  460. arXiv cs.CV TIER_1 English(EN) · Jiahao Yang, Zihan Wang, Xiangyang Li, Xing Zhu, Yujun Shen, Yinghao Xu, Shuqiang Jiang ·

    GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

    arXiv:2605.22036v1 Announce Type: new Abstract: Despite significant progress in Vision-Language Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational ove…

  461. arXiv cs.CV TIER_1 English(EN) · Xiaodong Mei, Diankun Zhang, Hongwei Xie, Guang Chen, Hangjun Ye, Dan Xu ·

    LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

    arXiv:2605.22089v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have emerged as a promising framework for end-to-end autonomous driving. However, existing VLAs typically rely on sparse action supervision, which underutilizes their powerful scene understanding …

  462. arXiv cs.CV TIER_1 English(EN) · Zhen Sun, Yongjian Guo, Haoran Sun, Luqiao Wang, Wei Lu, Jiachi Ji, Shengzhe Ji, Junwu Xiong, Zhijun Meng ·

    Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts

    arXiv:2605.22446v1 Announce Type: new Abstract: While large vision-language-action (VLA) models and generative world models (WM) have advanced long-horizon embodied intelligence, their practical deployment remains challenged by uncertainty in learning-based action generation. Low…

  463. arXiv cs.CV TIER_1 English(EN) · David M\'endez, Roberto Confalonieri, Natalia D\'iaz Rodr\'iguez ·

    Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling

    arXiv:2605.22484v1 Announce Type: new Abstract: Vision-Language Models (VLMs) excel at tasks like zero-shot classification and cross-modal retrieval by mapping images and text to a shared space, but this requires expensive end-to-end training with massive paired datasets. Current…

  464. arXiv cs.CV TIER_1 English(EN) · Hyejin Go, Semi Lee, Hyesong Choi ·

    What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining

    arXiv:2605.22651v1 Announce Type: new Abstract: CLIP-style contrastive pretraining typically curates web-scale image-text pairs using sample-level filtering signals, often based on pair-level alignment. We show that this signal saturates: once coarse mismatches are removed, stric…

  465. arXiv cs.CV TIER_1 English(EN) · Bing Hu, Zaijing Li, Rui Shao, Junda Chen, April Hua Liu, Wei-Shi Zheng, Liqiang Nie ·

    From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model

    arXiv:2605.22671v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models often suffer from performance degradation under distribution shifts, as they struggle to learn generalized behavior representations across varying environments. While existing approaches attempt t…

  466. arXiv cs.CV TIER_1 English(EN) · Wenxuan Guo, Ziyuan Li, Meng Zhang, Yichen Liu, Yimeng Dong, Chuxi Xu, Yunfei Wei, Ze Chen, Erjin Zhou, Jianjiang Feng ·

    GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

    arXiv:2605.22812v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models have shown strong potential for general-purpose robot manipulation by unifying perception and action. However, existing VLA systems primarily rely on textual instructions and struggle to resolve…

  467. arXiv cs.CV TIER_1 English(EN) · Wenxuan Guo, Xiuwei Xu, Yichen Liu, Xiangyu Li, Hang Yin, Huangxing Chen, Wenzhao Zheng, Jianjiang Feng, Jie Zhou, Jiwen Lu ·

    AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

    arXiv:2605.22816v1 Announce Type: cross Abstract: Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (…

  468. arXiv cs.CV TIER_1 English(EN) · Hansang Lee, Haeil Lee, Junmo Kim ·

    Do Vision Models Encode Object-Level Semantic Relatedness? A Cognitive Psychology-Inspired Benchmark

    arXiv:1709.03806v2 Announce Type: replace Abstract: Modern vision models have achieved strong object-recognition performance, yet it remains unclear whether their representations encode object-level semantic relatedness, the meaningful connection between object concepts that supp…

  469. arXiv cs.CV TIER_1 English(EN) · Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han ·

    Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

    arXiv:2505.16416v3 Announce Type: replace Abstract: Rotary Position Embedding (RoPE) is widely adopted in large language models, but when applied to vision-language models (VLMs) it couples text and image position indices and can introduce spurious cross-modal relative-position b…

  470. arXiv cs.CV TIER_1 English(EN) · Seulbi Lee, Sangheum Hwang ·

    Focusing Where Vision Matters: Selective Training for Large Vision Language Models via Visual Information Gain

    arXiv:2602.17186v2 Announce Type: replace Abstract: Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through dec…

  471. arXiv cs.CV TIER_1 English(EN) · Kangyi Wu, Pengna Li, Kailin Lyu, Xi Lin, Lin Zhao, Qingrong He, Jinjun Wang, Jianyi Liu ·

    Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

    arXiv:2604.17473v2 Announce Type: replace Abstract: Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video-LLMs) have largely advanced VLN, they remain highly…

  472. arXiv cs.CV TIER_1 English(EN) · Hanqing Liu, Mingjie Liu, Luoping Cui, Endian Lin, Donghong Jiang, Chuang Zhu ·

    RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

    arXiv:2605.19329v2 Announce Type: replace Abstract: Conventional vision-language models (VLMs) struggle to interpret scenes captured under adverse conditions (e.g., low light, high dynamic range, or fast motion) because standard RGB images degrade in such environments. Event came…

  473. arXiv cs.CV TIER_1 English(EN) · Jiwen Lu ·

    AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

    Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often…

  474. arXiv cs.CV TIER_1 English(EN) · Jianjiang Feng ·

    GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

    Vision-Language-Action (VLA) models have shown strong potential for general-purpose robot manipulation by unifying perception and action. However, existing VLA systems primarily rely on textual instructions and struggle to resolve spatial ambiguity in complex scenes with multiple…

  475. arXiv cs.CV TIER_1 English(EN) · Liqiang Nie ·

    From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model

    Vision-Language-Action (VLA) models often suffer from performance degradation under distribution shifts, as they struggle to learn generalized behavior representations across varying environments. While existing approaches attempt to construct behavior representations through act…

  476. arXiv cs.CV TIER_1 English(EN) · Hyesong Choi ·

    What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining

    CLIP-style contrastive pretraining typically curates web-scale image-text pairs using sample-level filtering signals, often based on pair-level alignment. We show that this signal saturates: once coarse mismatches are removed, stricter global filtering no longer tracks the compos…

  477. arXiv cs.CV TIER_1 English(EN) · Natalia Díaz Rodríguez ·

    Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling

    Vision-Language Models (VLMs) excel at tasks like zero-shot classification and cross-modal retrieval by mapping images and text to a shared space, but this requires expensive end-to-end training with massive paired datasets. Current post-hoc alignment methods reduce computational…

  478. arXiv cs.CV TIER_1 English(EN) · Cordelia Schmid ·

    PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction

    Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation by leveraging large pretrained vision-language backbones. However, most existing VLAs rely primarily on 2D visual representations, which limit their ability to reason about fi…

  479. arXiv cs.CV TIER_1 English(EN) · Xiangxiang Chu ·

    RISE: Reliable Improvement in Self-Evolving Vision-Language Models

    Vision-language models (VLMs) have achieved strong multimodal reasoning capabilities, but further improving them still relies heavily on large-scale human-constructed supervision for post-training. Such supervision is costly to obtain, especially for reasoning-intensive multimoda…

  480. arXiv cs.CV TIER_1 English(EN) · Jenq-Neng Hwang ·

    CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

    Vision-Language Models (VLMs) achieve strong performance on spatial question answering benchmarks, yet it remains unclear whether such gains reflect genuine spatial intelligence. We show that existing spatial VLMs lack basic camera motion understanding, a key component of spatial…

  481. arXiv cs.CV TIER_1 English(EN) · Xiu-Shen Wei ·

    Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models

    Vision-Language Models (VLMs), such as CLIP, have achieved significant zero-shot performance on downstream tasks with various fine-tuning adaptation methods. However, recent studies have proven that adversarial attacks can significantly degrade the inference ability of VLMs, posi…

  482. arXiv cs.CV TIER_1 English(EN) · Gemma Roig ·

    Mechanisms of Object Localization in Vision-Language Models

    Visually-grounded language models (VLMs) are highly effective in linking visual and textual information, yet they often struggle with basic classification and localization tasks. While classification mechanisms have been studied more extensively, the processes that support object…

  483. arXiv cs.CV TIER_1 English(EN) · Daquan Zhou ·

    StableVLA: Towards Robust Vision-Language-Action Models without Extra Data

    It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual condit…

  484. arXiv cs.CV TIER_1 English(EN) · Yishun Lu ·

    Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation

    Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses…

  485. arXiv cs.CV TIER_1 English(EN) · Shunli Zhang ·

    Neutral-Reference Prompting for Vision-Language Models

    Efficient transfer learning of vision-language models (VLMs) commonly suffers from a Base-New Trade-off (BNT): improving performance on unseen (new) classes often degrades accuracy on known (base) classes. Addressing how to boost recognition of unseen classes without sacrificing …

  486. arXiv cs.CV TIER_1 English(EN) · Pheng-Ann Heng ·

    ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

    Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-tri…

  487. arXiv cs.CV TIER_1 English(EN) · Zhiqiang Shen ·

    On the Cultural Anachronism and Temporal Reasoning in Vision Language Models

    Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, t…

  488. arXiv cs.CV TIER_1 English(EN) · Bo Zhao ·

    Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model

    Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely o…

  489. arXiv cs.CV TIER_1 English(EN) · Xingyu Chen ·

    SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

    General scene perception has progressed from object recognition toward open-vocabulary grounding, part localization, and affordance prediction. Yet these capabilities are often realized as isolated predictions that localize objects, parts, or interaction points without capturing …

  490. arXiv cs.CV TIER_1 English(EN) · Simon See ·

    MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

    Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two o…

  491. arXiv cs.CV TIER_1 English(EN) · Przemysław Biecek ·

    Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers

    Contrastively pre-trained Vision-Language Models (VLMs) serve as powerful feature extractors. Yet, their shared latent spaces are prone to structural anomalies and act as repositories for non-semantic, multi-modal noise. To address this phenomenon, we employ spectral decompositio…

  492. arXiv cs.CV TIER_1 English(EN) · Yangqiu Song ·

    Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

    Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insuff…

  493. arXiv cs.CV TIER_1 English(EN) · Lei Zhang ·

    Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

    In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models learn a direct "Sense-to-Act" m…

  494. arXiv cs.CV TIER_1 English(EN) · Yadong Mu ·

    RotVLA: Rotational Latent Action for Vision-Language-Action Model

    Latent Action Models (LAMs) have emerged as an effective paradigm for handling heterogeneous datasets during Vision-Language-Action (VLA) model pretraining, offering a unified action space across embodiments. However, existing LAMs often rely on discrete quantization encode and d…

  495. arXiv cs.CV TIER_1 English(EN) · Ting Cao ·

    GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

    In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, vi…

  496. arXiv cs.CV TIER_1 English(EN) · Stefano Peluchetti ·

    KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models

    Kamon (family crests) are an important part of Japanese culture and a natural test case for compositional visual recognition: each crest combines a small number of symbolic choices, but the space of possible descriptions is sparse. We introduce KamonBench, a grammar-based image-t…

  497. arXiv cs.CV TIER_1 English(EN) · Yukyung Choi ·

    CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

    In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less informative visual tokens for image understanding tasks. Howeve…

  498. arXiv cs.CV TIER_1 English(EN) · Jingyuan Chen ·

    A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning

    Efficient transfer learning methods for large-scale vision-language models ($e.g.$, CLIP) enable strong few-shot transfer, yet existing adaptation methods follow a fixed fine-tuning paradigm that implicitly assumes a uniform importance of the image and text branches, which has no…

  499. arXiv cs.CV TIER_1 English(EN) · Siheng Chen ·

    Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

    Vision-language-action (VLA) models are effective robot action executors, but they remain limited on long-horizon tasks due to the dual burden of extended closed-loop planning and diverse physical operations. We therefore propose VLAs-as-Tools, a strategy that distributes this bu…

  500. arXiv cs.CV TIER_1 English(EN) · Sangdoo Yun ·

    Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

    When humans describe a visual scene, they do not process the entire image uniformly; instead, they selectively fixate on regions relevant to their intended description. In contrast, current multimodal large language models (MLLMs) attend to all visual tokens at each generation st…

  501. arXiv cs.CV TIER_1 English(EN) · Miguel P. Eckstein ·

    Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency

    Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate c…

  502. arXiv cs.CV TIER_1 English(EN) · Zitong Yu ·

    Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

    Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large La…

  503. arXiv cs.CV TIER_1 English(EN) · Alan Yuille ·

    LychSim: A Controllable and Interactive Simulation Framework for Vision Research

    While self-supervised pretraining has reduced vision systems' reliance on synthetic data, simulation remains an indispensable tool for closed-loop optimization and rigorous out-of-distribution (OOD) evaluation. However, modern simulation platforms often present steep technical ba…

  504. arXiv cs.CV TIER_1 English(EN) · Guanjun Jiang ·

    Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

    Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We id…

  505. arXiv cs.CV TIER_1 English(EN) · Feng Dai ·

    VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference

    Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses t…

  506. arXiv cs.CV TIER_1 English(EN) · Yulun Zhang ·

    G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

    The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual token reduction for improving the efficiency of separate-encoder UMMs. While…

  507. arXiv cs.CV TIER_1 English(EN) · Xun Wang ·

    Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

    While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images…

  508. arXiv cs.CV TIER_1 English(EN) · Zheng-Jun Zha ·

    Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

    In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent v…

  509. arXiv cs.CV TIER_1 English(EN) · Philipp Johannes Schubert ·

    BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding

    Scene understanding is central to general physical intelligence, and video is a primary modality for capturing both state and temporal dynamics of a scene. Yet understanding physical processes remains difficult, as models must combine object localization, hand-object interactions…

  510. arXiv cs.CV TIER_1 English(EN) · Fei Tian ·

    Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audi…

  511. arXiv cs.CV TIER_1 English(EN) · Paolo Soda ·

    Resilient Vision-Tabular Multimodal Learning under Modality Missingness

    Multimodal deep learning has shown strong potential in medical applications by integrating heterogeneous data sources such as medical images and structured clinical variables. However, most existing approaches implicitly assume complete modality availability, an assumption that r…

  512. arXiv cs.CV TIER_1 English(EN) · Chenggang Yan ·

    Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models

    Prompt learning has emerged as an efficient alternative to fine-tuning pre-trained vision-language models (VLMs). Despite its promise, current methods still struggle to maintain tail-class discriminability when adapting to class-imbalanced datasets. In this work, we propose clust…

  513. arXiv cs.CV TIER_1 English(EN) · Dacheng Tao ·

    Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training

    Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enablin…

  514. arXiv cs.CV TIER_1 English(EN) · Haoang Li ·

    CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

    This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can …

  515. arXiv cs.CV TIER_1 English(EN) · Gang Pan ·

    ALAM: Algebraically Consistent Latent Transitions for Vision-Language-Action Models

    Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstr…

  516. arXiv cs.CV TIER_1 English(EN) · Shen Li ·

    C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving

    Safety-critical planning in complex environments, particularly at urban intersections, remains a fundamental challenge for autonomous driving. Existing methods, whether rule-based or data-driven, frequently struggle to capture complex scene semantics, infer potential risks, and m…

  517. arXiv cs.CV TIER_1 English(EN) · Qingyao Wu ·

    Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium

    During MLLM decoding, attention often abnormally concentrates on irrelevant image tokens. While existing research dismisses this as invalid noise and forcibly redirects attention to compel focusing on key image information, we argue these tokens are critical carriers of visual an…

  518. arXiv cs.CV TIER_1 English(EN) · Vasileios Mezaris ·

    LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models

    Large Vision-Language Models (VLMs) are successful in addressing a multitude of vision-language understanding tasks, such as Visual Question Answering (VQA), but their memory and compute requirements remain a concern for practical deployment. A promising class of techniques for m…

  519. arXiv cs.CV TIER_1 English(EN) · Wei He ·

    SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

    Low-level visual perception underpins reliable remote sensing (RS) image analysis, yet current image quality assessment (IQA) methods output uninterpretable scalar scores rather than characterizing physics-driven RS degradations, deviating markedly from the diagnostic needs of RS…

  520. arXiv cs.CV TIER_1 English(EN) · Wenzhao Zheng ·

    Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

    Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision m…

  521. arXiv cs.CV TIER_1 English(EN) · Jinsong Su ·

    Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models

    Vision-language models (VLMs) raise growing concerns about privacy, copyright, and bias, motivating machine unlearning to remove sensitive knowledge. However, existing methods primarily fine-tune the language decoder, leading to superficial forgetting that fails to erase underlyi…

  522. arXiv cs.CV TIER_1 English(EN) · Plachetka Christopher ·

    Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models

    Over the last few years, research on autonomous systems has matured to such a degree that the field is increasingly well-positioned to translate research into practical, stakeholder-driven use cases across well-defined domains. However, for a wide-scale practical adoption of auto…

  523. arXiv cs.CV TIER_1 English(EN) · Zhanyu Ma ·

    PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

    Mainstream vision-language models (VLMs) fundamentally struggle with severe optical ambiguities, such as reflections and transparent objects, due to the inherent limitations of standard RGB inputs. While polarization imaging captures polarimetric physical parameters that resolve …

  524. arXiv cs.CV TIER_1 English(EN) · Cheng Deng ·

    Hierarchical Dual-Subspace Decoupling for Continual Learning in Vision-Language Models

    Class-incremental learning aims to continuously acquire new knowledge while preserving previously learned information, thereby mitigating catastrophic forgetting. Existing methods primarily restrict parameter updates but often overlook their structural properties in high-dimensio…

  525. arXiv cs.CV TIER_1 English(EN) · Cheng Deng ·

    DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models

    Continual learning enables vision-language models to accumulate knowledge and adapt to evolving tasks without retraining from scratch. However, in multi-domain task-incremental learning, large domain shifts intensify the stability-plasticity dilemma. Most existing methods rely on…

  526. arXiv cs.CV TIER_1 English(EN) · Zheng Li, Jerry Cheng, Huanying Helen Gu ·

    StableTTA: Improving Vision Model Performance by Training-free Test-Time Adaptation Methods

    arXiv:2604.04552v3 Announce Type: replace Abstract: Ensemble methods improve predictive performance but often incur high memory and computational costs. We identify an aggregation instability induced by nonlinear projection and voting operations. To address both efficiency challe…

  527. arXiv cs.CV TIER_1 Deutsch(DE) · Gongli Xi, Ye Tian, Mengyu Yang, Huahui Yi, Liang Lin, Xiaoshuai Hao, Kun Wang, Wendong Wang ·

    Large Vision-Language Models Get Lost in Attention

    arXiv:2605.05668v1 Announce Type: cross Abstract: Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct…

  528. arXiv cs.CV TIER_1 English(EN) · Jintao Sun, Gangyi Ding, Donglin Di, Hu Zhang, Zhedong Zheng ·

    Can Vision-Language Models Think from the Sky? Unifying UAV Reasoning and Generation

    arXiv:2604.05377v2 Announce Type: replace Abstract: Vision-Language Models have achieved strong progress in ground-view visual understanding, yet they remain brittle in high-altitude Unmanned Aerial Vehicle scenes, where objects are tiny and densely packed, textures are repetitiv…

  529. arXiv cs.CV TIER_1 English(EN) · Zhenyu Wu ·

    DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency

    Contrastive language-image pretraining (CLIP) suffers from two structural weaknesses: the symmetric InfoNCE loss discards the relative ordering among unmatched in-batch pairs, and global pooling collapses the visual representation into a semantic bottleneck that is poorly sensiti…

  530. arXiv cs.CV TIER_1 English(EN) · Jiajin Guan (Research Institute of Electronic Science and Technology, University of Electronic Science and Technology of China, Chengdu, China), Haibo Mei (School of Aeronautics and Astronautics, University of Electronic Science and Technology of China, C ·

    UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning

    arXiv:2508.11196v2 Announce Type: replace Abstract: Recent advances in vision-language models (VLMs) have demonstrated strong generalization in natural image tasks. However, their performance often degrades on unmanned aerial vehicle (UAV)-based aerial imagery, which features hig…

  531. arXiv cs.CV TIER_1 English(EN) · Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Libo Qin, Ruihan Chen, Lei Huang, Baohang Li, Kui Jiang, Yaowei Wang, Ting Liu, Bing Qin ·

    CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

    arXiv:2605.04641v1 Announce Type: new Abstract: Although Large Vision-Language Models (LVLMs) have demonstrated remarkable performance on downstream tasks, they frequently produce contents that deviate from visual information, leading to object hallucination. To tackle this, rece…

  532. arXiv cs.CV TIER_1 English(EN) · Yihan Lin, Haoyang Li, Yang Li, Haitao Shen, Yihan Zhao, Chao Shao, Jing Zhang ·

    From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

    arXiv:2605.04678v1 Announce Type: cross Abstract: Latent actions serve as an intermediate representation that enables consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs with latent actions are fragment…

  533. arXiv cs.CV TIER_1 English(EN) · Jing Zhang ·

    From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

    Latent actions serve as an intermediate representation that enables consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs with latent actions are fragmented and lack a systematic comparison. This work str…

  534. arXiv cs.CV TIER_1 English(EN) · Bing Qin ·

    CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

    Although Large Vision-Language Models (LVLMs) have demonstrated remarkable performance on downstream tasks, they frequently produce contents that deviate from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annot…

  535. arXiv cs.CV TIER_1 English(EN) · JF Bastien, Sam D'Amico ·

    VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models

    arXiv:2605.03351v1 Announce Type: new Abstract: Video vision-language models (VLMs) keep paying for visual state the stream already told us was stable. The factory wall did not move, but most VLM pipelines still hand the model dense RGB frames or a fresh prefix again. We study th…

  536. arXiv cs.CV TIER_1 English(EN) · Sen Nie, Jie Zhang, Zhongqi Wang, Zhaoyang Wei, Shiguang Shan, Xilin Chen ·

    What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

    arXiv:2603.12799v2 Announce Type: replace Abstract: Achieving adversarial robustness in Vision-Language Models (VLMs) inevitably compromises accuracy on clean data, presenting a long-standing and challenging trade-off. In this work, we revisit this trade-off by investigating a fu…

  537. arXiv cs.CV TIER_1 English(EN) · Parsa Madinei, Srijita Karmakar, Russell Cohen Hoffing, Felix Gervitz, Miguel P. Eckstein ·

    IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models

    arXiv:2602.16138v2 Announce Type: replace Abstract: We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 uniqu…

  538. arXiv cs.CV TIER_1 English(EN) · Christian Rominger (University of Graz), Andreas R. Schwerdtfeger (University of Graz), Malay Gaherwar Singh (TU Dresden), Dimitri Khudyakow (TU Dresden), Elizabeth A. M. Michels (TU Dresden), Fabian Wolf (TU Dresden), Jakob Nikolas Kather (TU Dresden, Un ·

    Quantifying the human visual exposome with vision language models

    arXiv:2605.03863v1 Announce Type: cross Abstract: The visual environment is a fundamental yet unquantified determinant of mental health. While the concept of the environmental exposome is well established, current methods rely on coarse geospatial proxies or biased self reports, …

  539. arXiv cs.CV TIER_1 English(EN) · Xiaowen Sun, Matthias Kerzel, Mengdi Li, Xufeng Zhao, Paul Striker, Stefan Wermter ·

    StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning

    arXiv:2605.03927v1 Announce Type: new Abstract: Vision-language models (VLMs) have shown remarkable performance in various robotic tasks, as they can perceive visual information and understand natural language instructions. However, when applied to robotics, VLMs remain subject t…

  540. arXiv cs.CV TIER_1 English(EN) · Kangkang Wang, Qinting Jiang, Wanping Zhang, Bowen Ren, Shengzhao Wen ·

    MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models

    arXiv:2605.03485v1 Announce Type: new Abstract: Multidimensional human understanding is essential for real-world applications such as film analysis and virtual digital humans, yet current LVLM benchmarks largely focus on single-task settings and lack fine-grained, human-centric e…

  541. arXiv cs.CV TIER_1 English(EN) · Yujun Li, Hongyuan Zhang, Yuan Yuan ·

    GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning

    arXiv:2605.03403v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO) has recently shown strong performance in post-training large language models and vision-language models. It raises a question of whether the GRPO also significantly promotes the test-time ad…

  542. arXiv cs.CV TIER_1 English(EN) · Stefan Wermter ·

    StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning

    Vision-language models (VLMs) have shown remarkable performance in various robotic tasks, as they can perceive visual information and understand natural language instructions. However, when applied to robotics, VLMs remain subject to a fundamental limitation inherent in large lan…

  543. arXiv cs.CV TIER_1 English(EN) · Yuan Yuan ·

    GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning

    Group Relative Policy Optimization (GRPO) has recently shown strong performance in post-training large language models and vision-language models. It raises a question of whether the GRPO also significantly promotes the test-time adaptation (TTA) of vision language models. In thi…

  544. arXiv cs.CV TIER_1 English(EN) · Sam D'Amico ·

    VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models

    Video vision-language models (VLMs) keep paying for visual state the stream already told us was stable. The factory wall did not move, but most VLM pipelines still hand the model dense RGB frames or a fresh prefix again. We study that waste as training-free anti-recomputation: re…

  545. arXiv cs.CV TIER_1 English(EN) · Zeshang Li, Shuoyang Zhang, Jiashen Ding ·

    GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models

    arXiv:2605.01733v1 Announce Type: new Abstract: Vision-Language Models (VLMs) excel at grounded reasoning but remain prone to object hallucination. Recent work treats self-generated captions as a uniformly positive resource, yet we find that naively embedding one can degrade rath…

  546. arXiv cs.CV TIER_1 English(EN) · Yagiz Nalcakan, Hyeongjin Ju, Incheol Park, Sanghyeop Yeo, Youngwan Jin, Shiho Kim ·

    SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters

    arXiv:2605.02258v1 Announce Type: new Abstract: Vision Foundation Models (VFMs) pretrained on large-scale RGB data have demonstrated remarkable representation quality, yet their applicability to multispectral imaging spanning Near-Infrared (NIR), Short-Wave Infrared (SWIR), and L…

  547. arXiv cs.CV TIER_1 English(EN) · Zhou Bingtao, Xiang Mian, Ning Qian ·

    Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model

    arXiv:2605.02604v1 Announce Type: new Abstract: Source-Free Domain Adaptation (SFDA) adapts source models to target domains without accessing source data, addressing privacy and transmission issues. However, existing methods still initialize from a source pre-trained model and th…

  548. arXiv cs.CV TIER_1 English(EN) · Chenyu Hui, Xiaodi Huang, Siyu Xu, Yunke Wang, Shan You, Fei Wang, Tao Huang, Chang Xu ·

    Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation

    arXiv:2605.02757v1 Announce Type: new Abstract: Vision-language-action (VLA) models typically rely on large-scale real-world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substantial visual domain gap and limi…

  549. arXiv cs.CV TIER_1 (ET) · Andreas Koukounas, Georgios Mastrapas, Florian H\"onicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao ·

    jina-vlm: Small Multilingual Vision Language Model

    arXiv:2512.04032v3 Announce Type: replace-cross Abstract: We present jina-vlm, a token-efficient 2.4B parameter vision-language model that achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 lang…

  550. arXiv cs.CV TIER_1 English(EN) · Anjie Liu, Ziqin Gong, Yan Song, Yuxiang Chen, Xiaolong Liu, Hengtong Lu, Kaike Zhang, Chen Wei ·

    Active Reasoning Vision-Language Models via Sequential Experimental Design

    arXiv:2605.01345v1 Announce Type: new Abstract: Visual perception in modern Vision-Language Models (VLMs) is constrained by a fundamental perceptual bandwidth bottleneck: a broad field of view inevitably sacrifices the fine-grained details necessary for complex reasoning. Inspire…

  551. arXiv cs.CV TIER_1 English(EN) · Yin Zhang, Jiaxuan Zhao, Zonghan Wu, Zengxiang Li, Junfeng Fang, Kun Wang, Qingsong Wen, Yilei Shao ·

    MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models

    arXiv:2605.01520v1 Announce Type: new Abstract: Vision-Language Models (VLMs) frequently suffer from visual perception errors and hallucinations that compromise answer accuracy in complex reasoning tasks. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising so…

  552. arXiv cs.CV TIER_1 English(EN) · \"Umit Mert \c{C}a\u{g}lar, Alptekin Temizel ·

    Grounding Synthetic Data Generation With Vision and Language Models

    arXiv:2603.09625v2 Announce Type: replace Abstract: Deep learning models benefit from increasing data diversity and volume, motivating synthetic data augmentation to improve existing datasets. However, existing evaluation metrics for synthetic data typically calculate latent feat…

  553. arXiv cs.CV TIER_1 English(EN) · Chang Xu ·

    Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation

    Vision-language-action (VLA) models typically rely on large-scale real-world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substantial visual domain gap and limited environmental diversity, resulting in weak r…

  554. arXiv cs.CV TIER_1 English(EN) · Ning Qian ·

    Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model

    Source-Free Domain Adaptation (SFDA) adapts source models to target domains without accessing source data, addressing privacy and transmission issues. However, existing methods still initialize from a source pre-trained model and thus are not truly source-free. Recent works have …

  555. arXiv cs.CV TIER_1 English(EN) · Shiho Kim ·

    SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters

    Vision Foundation Models (VFMs) pretrained on large-scale RGB data have demonstrated remarkable representation quality, yet their applicability to multispectral imaging spanning Near-Infrared (NIR), Short-Wave Infrared (SWIR), and Long-Wave Infrared (LWIR) remains largely unexplo…

  556. arXiv cs.CV TIER_1 English(EN) · Yan Fang, Mengcheng Lan, Zilong Huang, Weixian Lei, Yunqing Zhao, Yujie Zhong, Yingchen Yu, Qi She, Yao Zhao, Yunchao Wei ·

    Let ViT Speak: Generative Language-Image Pre-training

    arXiv:2605.00809v1 Announce Type: new Abstract: In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language model…

  557. arXiv cs.CV TIER_1 English(EN) · Jiayu Li, Jiaxin Qi, Sheng Zhou, Jiaqiang Huang, Xiansheng Hua ·

    Intrinsic Gradient Suppression for Label-Noise Prompt Tuning in Vision-Language Models

    arXiv:2605.00591v1 Announce Type: new Abstract: Contrastive vision-language models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can ove…

  558. arXiv cs.CV TIER_1 English(EN) · Aharon Azulay, Jan Dubi\'nski, Zhuoyun Li, Atharv Mittal, Yossi Gandelsman ·

    Jailbreaking Vision-Language Models Through the Visual Modality

    arXiv:2605.00583v1 Announce Type: new Abstract: The visual modality of vision-language models (VLMs) is an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks exploiting the vision component: (1) encoding harmful instructions as visual…

  559. arXiv cs.CV TIER_1 English(EN) · Phuong Ngoc Nguyen, Kaito Shiku, Ryoma Bise, Seiichi Uchida, Shinnosuke Matsuo ·

    Leveraging Vision-Language Models as Weak Annotators in Active Learning

    arXiv:2605.00480v1 Announce Type: new Abstract: Active learning aims to reduce annotation cost by selectively querying informative samples for supervision under a limited labeling budget. In this work, we investigate how vision-language models (VLMs) can be leveraged to further r…

  560. arXiv cs.CV TIER_1 English(EN) · Minghui Chen, Chenxu Yang, Hengjie Zhu, Dayan Wu, Zheng Lin, Qingyi Si ·

    Online Self-Calibration Against Hallucination in Vision-Language Models

    arXiv:2605.00323v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from…

  561. arXiv cs.CV TIER_1 English(EN) · Yunchao Wei ·

    Let ViT Speak: Generative Language-Image Pre-training

    In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with …

  562. arXiv cs.CV TIER_1 English(EN) · Xiansheng Hua ·

    Intrinsic Gradient Suppression for Label-Noise Prompt Tuning in Vision-Language Models

    Contrastive vision-language models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can overwhelm pre-trained priors. We argue that because…

  563. arXiv cs.CV TIER_1 English(EN) · Yossi Gandelsman ·

    Jailbreaking Vision-Language Models Through the Visual Modality

    The visual modality of vision-language models (VLMs) is an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks exploiting the vision component: (1) encoding harmful instructions as visual symbol sequences with a decoding legend, (2) re…

  564. arXiv cs.CV TIER_1 English(EN) · Shinnosuke Matsuo ·

    Leveraging Vision-Language Models as Weak Annotators in Active Learning

    Active learning aims to reduce annotation cost by selectively querying informative samples for supervision under a limited labeling budget. In this work, we investigate how vision-language models (VLMs) can be leveraged to further reduce the reliance on costly human annotation wi…

  565. arXiv cs.CV TIER_1 English(EN) · Mingliang Liang, Zhuoran Liu, Arjen P. de Vries, Martha Larson ·

    Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

    arXiv:2604.27932v1 Announce Type: new Abstract: The computational cost of training a vision-language model (VLM) can be reduced by sampling the training data. Previous work on efficient VLM pre-training has pointed to the importance of semantic data balance, adjusting the distrib…

  566. arXiv cs.CV TIER_1 English(EN) · Hyeonseo Jang, Jaebyeong Jeon, Joong-Won Hwang, Kibok Lee ·

    Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining

    arXiv:2604.27715v1 Announce Type: new Abstract: Test-time prompt tuning (TPT) has emerged as a promising technique for enhancing the adaptability of vision-language models by optimizing textual prompts using unlabeled test data. However, prior studies have observed that TPT often…

  567. arXiv cs.CV TIER_1 English(EN) · Pengna Li, Kangyi Wu, Shaoqing Xu, Fang Li, Hanbing Li, Lin Zhao, Kailin Lyu, Long Chen, Zhi-Xin Yang, Nanning Zheng ·

    SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

    arXiv:2604.27620v1 Announce Type: new Abstract: Vision-and-Language Navigation (VLN) aims to enable an embodied agent to follow natural-language instructions and navigate to a target location in unseen 3D environments. We argue that adapting VLMs to VLN requires endowing them wit…

  568. arXiv cs.CV TIER_1 English(EN) · Hong-Tao Yu, Yuxin Peng, Serge Belongie, Xiu-Shen Wei ·

    Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

    arXiv:2504.14988v4 Announce Type: replace Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both h…

  569. arXiv cs.CV TIER_1 English(EN) · Mengling Deng, Yuanpeng Chen, Sheng Yang, Wei Tao, Wenhai Zhang, Hui Song, Linyuanhao Qin, Kai Zhao, Xiaojun Ye, Shanhui Mo, Jingli Fan, Shuang Zhang, Bei Liu, Tiankun Zhao, Xiangjing An ·

    EdgeFM: Efficient Edge Inference for Vision-Language Models

    arXiv:2604.27476v1 Announce Type: new Abstract: Vision-language models (VLMs) have demonstrated strong applicability in edge industrial applications, yet their deployment remains severely constrained by requirements for deterministic low latency and stable execution under resourc…

  570. arXiv cs.CV TIER_1 English(EN) · Qingyi Si ·

    Online Self-Calibration Against Hallucination in Vision-Language Models

    Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from stronger models such as GPT. However, this offl…

  571. arXiv cs.CV TIER_1 English(EN) · Kenneth J. K. Ong ·

    The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models

    As Vision-Language Models (VLMs) become increasingly integrated into decision-making systems, it is essential to understand how visual inputs influence their behavior. This paper investigates the effects of visual priming on VLMs' cooperative behavior using the Iterated Prisoner'…

  572. arXiv cs.CV TIER_1 English(EN) · Martha Larson ·

    Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

    The computational cost of training a vision-language model (VLM) can be reduced by sampling the training data. Previous work on efficient VLM pre-training has pointed to the importance of semantic data balance, adjusting the distribution of topics in the data to improve VLM accur…

  573. arXiv cs.CV TIER_1 English(EN) · Kibok Lee ·

    Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining

    Test-time prompt tuning (TPT) has emerged as a promising technique for enhancing the adaptability of vision-language models by optimizing textual prompts using unlabeled test data. However, prior studies have observed that TPT often produces poorly calibrated models, raising conc…

  574. arXiv cs.CV TIER_1 English(EN) · Nanning Zheng ·

    SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

    Vision-and-Language Navigation (VLN) aims to enable an embodied agent to follow natural-language instructions and navigate to a target location in unseen 3D environments. We argue that adapting VLMs to VLN requires endowing them with two complementary capabilities for acquiring s…

  575. arXiv cs.CV TIER_1 English(EN) · Xiangjing An ·

    EdgeFM: Efficient Edge Inference for Vision-Language Models

    Vision-language models (VLMs) have demonstrated strong applicability in edge industrial applications, yet their deployment remains severely constrained by requirements for deterministic low latency and stable execution under resource limitations. Existing frameworks either rely o…

  576. arXiv cs.CV TIER_1 English(EN) · Junru Song, Yimeng Hu, Yijing Chen, Huining Li, Qian Li, Lizhen Cui, Yuntao Du ·

    Delineating Knowledge Boundaries for Honest Large Vision-Language Models

    arXiv:2604.26419v1 Announce Type: new Abstract: Large Vision-Language Models (VLMs) have achieved remarkable multimodal performance yet remain prone to factual hallucinations, particularly in long-tail or specialized domains. Moreover, current models exhibit a weak capacity to re…

  577. arXiv cs.CV TIER_1 English(EN) · Junwon You, Mihyun Jang, Sangwoo Mo, Jae-Hun Jung ·

    Topology-Aware Representation Alignment for Semi-Supervised Vision-Language Learning

    arXiv:2604.26370v1 Announce Type: new Abstract: Vision-language models have shown strong performance, but they often generalize poorly to specialized domains. While semi-supervised vision-language learning mitigates this limitation by leveraging a small set of labeled image-text …

  578. arXiv cs.CV TIER_1 English(EN) · Zheng Liu, Mengjie Liu, Jingzhou Chen, Jingwei Xu, Bin Cui, Conghui He, Wentao Zhang ·

    FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

    arXiv:2504.09925v3 Announce Type: replace Abstract: We introduce FLARE, a family of vision language models (VLMs) with a fully vision-language alignment and integration paradigm. Unlike existing approaches that rely on single MLP projectors for modality alignment and defer cross-…

  579. arXiv cs.CV TIER_1 English(EN) · Cyril Shih-Huan Hsu, Wig Yuan-Cheng Cheng, Chrysa Papagianni ·

    Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models

    arXiv:2604.26508v1 Announce Type: cross Abstract: Deploying Vision-Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed the capabilities of resource-constrained embedded platforms. Conversely, fully off…

  580. arXiv cs.CV TIER_1 English(EN) · Chrysa Papagianni ·

    Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models

    Deploying Vision-Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed the capabilities of resource-constrained embedded platforms. Conversely, fully offloading inference to the cloud is often impractica…

  581. arXiv cs.CV TIER_1 English(EN) · Yuntao Du ·

    Delineating Knowledge Boundaries for Honest Large Vision-Language Models

    Large Vision-Language Models (VLMs) have achieved remarkable multimodal performance yet remain prone to factual hallucinations, particularly in long-tail or specialized domains. Moreover, current models exhibit a weak capacity to refuse queries that exceed their parametric knowle…

  582. arXiv cs.CV TIER_1 English(EN) · Jae-Hun Jung ·

    Topology-Aware Representation Alignment for Semi-Supervised Vision-Language Learning

    Vision-language models have shown strong performance, but they often generalize poorly to specialized domains. While semi-supervised vision-language learning mitigates this limitation by leveraging a small set of labeled image-text pairs together with abundant unlabeled images, e…

  583. arXiv cs.CV TIER_1 English(EN) · Soroush Seifi, Vaggelis Dorovatas, Matteo Cassinelli, Fabien Despinoy, Daniel Olmeda Reino, Rahaf Aljundi ·

    Personalization Toolkit: Training Free Personalization of Large Vision Language Models

    arXiv:2502.02452v4 Announce Type: replace Abstract: Personalization of Large Vision-Language Models (LVLMs) involves customizing models to recognize specific users or object instances and to generate contextually tailored responses. Existing approaches rely on time-consuming trai…

  584. arXiv cs.CV TIER_1 English(EN) · Yashwant Pravinrao Bangde, Debaditya Roy ·

    Instruction-Evidence Contrastive Dual-Stream Decoding for Grounded Vision-Language Reasoning

    arXiv:2604.25809v1 Announce Type: new Abstract: Vision-Language Models (VLMs) exhibit strong performance in instruction following and open-ended vision-language reasoning, yet they frequently generate fluent outputs that are weakly grounded in visual evidence. Prior works have sh…

  585. arXiv cs.CV TIER_1 English(EN) · Debaditya Roy ·

    Instruction-Evidence Contrastive Dual-Stream Decoding for Grounded Vision-Language Reasoning

    Vision-Language Models (VLMs) exhibit strong performance in instruction following and open-ended vision-language reasoning, yet they frequently generate fluent outputs that are weakly grounded in visual evidence. Prior works have shown that instruction prompting further worsens t…

  586. arXiv cs.CV TIER_1 Deutsch(DE) · Kanishk Jain, Qian Yang, Shravan Nayak, Parisa Kordjamshidi, Nishanth Anand, Aishwarya Agrawal ·

    Discovering Failure Modes in Vision-Language Models using RL

    arXiv:2604.04733v2 Announce Type: replace Abstract: Vision-language Models (VLMs), despite achieving strong performance on multimodal benchmarks, often misinterpret straightforward visual concepts that humans identify effortlessly, such as counting, spatial reasoning, and viewpoi…

  587. arXiv cs.CV TIER_1 English(EN) · Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva ·

    LinguDistill: Recovering Linguistic Ability in Vision-Language Models via Selective Cross-Modal Distillation

    arXiv:2604.00829v3 Announce Type: replace Abstract: Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such l…

  588. arXiv cs.CV TIER_1 English(EN) · Myeongkyun Kang, Soopil Kim, Xiaoxiao Li, Sang Hyun Park ·

    Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

    arXiv:2603.19482v2 Announce Type: replace Abstract: Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curate…

  589. arXiv cs.CV TIER_1 English(EN) · Soumyaratna Debnath, Bui Duc Manh, Zinan Liu, Lin Wang ·

    LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

    arXiv:2603.14882v2 Announce Type: replace Abstract: Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither unifo…

  590. arXiv cs.CV TIER_1 Italiano(IT) · Woojun Jung, Jaehoon Go, Mingyu Jeon, Sunjae Yoon, Junyeong Kim ·

    Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

    arXiv:2512.10362v2 Announce Type: replace Abstract: Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop s…

  591. arXiv cs.CV TIER_1 English(EN) · Zikun Guo, Jingwei Lv, Xinyue Xu, Shu Yang, Jun Wen, Di Wang, Lijie Hu ·

    Benchmarking and Mitigating Sycophancy in Medical Vision Language Models

    arXiv:2509.21979v4 Announce Type: replace Abstract: Visual language models (VLMs) have the potential to transform medical workflows. However, the deployment is limited by sycophancy. Despite this serious threat to patient safety, a systematic benchmark remains lacking. This paper…

  592. arXiv cs.CV TIER_1 English(EN) · Hanqi Yan, Xiangxiang Cui, Lu Yin, Jindong Gu, Paul Pu Liang, Yulan He, Yifei Wang ·

    Beyond Cross-Modal Alignment: Measuring and Leveraging Modality Gap in Vision-Language Models

    arXiv:2502.14888v4 Announce Type: replace Abstract: The success of vision-language models is primarily attributed to effective alignment across modalities such as vision and language. However, modality gaps persist in existing alignment algorithms and appear necessary for human p…

  593. arXiv cs.CV TIER_1 English(EN) · Zongxia Li, Wenhao Yu, Chengsong Huang, Zhenwen Liang, Rui Liu, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, Dong Yu ·

    Self-Rewarding Vision-Language Model via Reasoning Decomposition

    arXiv:2508.19652v2 Announce Type: replace Abstract: Vision-Language Models (VLMs) often suffer from visual hallucinations: generating things that are not consistent with visual inputs and language shortcuts, where they skip the visual part and just rely on text priors. These issu…

  594. arXiv cs.CV TIER_1 English(EN) · Tairan Fu, Francisco Javier Santos-Mart\'in, Javier Conde, Pedro Reviriego, Elena Merino-G\'omez ·

    Lost in the Vibrations: Vision Language Models Fail the Dynamic Gauges Test

    arXiv:2604.22829v1 Announce Type: new Abstract: The digital transformation of industrial manufacturing increasingly relies on the ability of autonomous robots to interact with legacy infrastructure, particularly analog gauges. While Vision-Language Models (VLMs) have demonstrated…

  595. arXiv cs.CV TIER_1 English(EN) · Brandon Collins, Logan Bolton, Hung Huy Nguyen, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen ·

    SketchVLM: Vision language models can annotate images to explain thoughts and guide users

    arXiv:2604.22875v1 Announce Type: new Abstract: When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision-language models (VLMs) such as Gemini-3-Pro and GPT-5 only respond with text, which can be difficu…

  596. arXiv cs.CV TIER_1 English(EN) · Ashwin Kumar, Robbie Holland, Corey Barrett, Jangwon Kim, Maya Varma, Zhihong Chen, Yunhe Gao, Greg Zaharchuk, Tara Taghavi, Krishnaram Kenthapadi, Akshay Chaudhari ·

    CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging

    arXiv:2604.22989v1 Announce Type: new Abstract: Recent medical multimodal foundation models are built as multimodal LLMs (MLLMs) by connecting a CLIP-pretrained vision encoder to an LLM using LLaVA-style finetuning. This two-stage, decoupled approach introduces a projection layer…

  597. arXiv cs.CV TIER_1 English(EN) · Rinyoichi Takezoe, Yaqian Li, Zihao Bo, Anzhou Hou, Mo Guang, Kaiwen Long ·

    LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

    arXiv:2604.23950v1 Announce Type: new Abstract: Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address …

  598. arXiv cs.CV TIER_1 English(EN) · Yingqian Min, Kun Zhou, Yifan Li, Yuhuan Wu, Han Peng, Yifan Du, Wayne Xin Zhao, Min Yang, Ji-Rong Wen ·

    Improving Vision-language Models with Perception-centric Process Reward Models

    arXiv:2604.24583v1 Announce Type: new Abstract: Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnos…

  599. arXiv cs.CV TIER_1 English(EN) · Lixian Chen, Mingxuan Huang, Yanhui Chen, Junyi Lin, Yang Shi ·

    Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift

    arXiv:2604.24602v1 Announce Type: new Abstract: Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while…

  600. arXiv cs.CV TIER_1 English(EN) · Fan Du, Feng Yan, Jianxiong Wu, Xinrun Xu, Weiye Zhang, Weinong Wang, Yu Guo, Bin Qian, Zhihai He ·

    CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

    arXiv:2604.24622v1 Announce Type: new Abstract: Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian n…

  601. arXiv cs.CV TIER_1 (CA) · Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishn ·

    NVILA: Efficient Frontier Visual Language Models

    arXiv:2412.04468v3 Announce Type: replace Abstract: Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to jointly optimiz…

  602. arXiv cs.CV TIER_1 English(EN) · Danae S\'anchez Villegas, Samuel Lewis-Lim, Nikolaos Aletras, Desmond Elliott ·

    Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

    arXiv:2604.14888v2 Announce Type: replace-cross Abstract: Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instructio…

  603. arXiv cs.CV TIER_1 English(EN) · Wenxuan Bao, Yanjun Zhao, Xiyuan Yang, Jingrui He ·

    Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

    arXiv:2604.21728v2 Announce Type: replace Abstract: Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labe…

  604. arXiv cs.CV TIER_1 English(EN) · Baoyou Chen, Hanchen Xia, Peng Tu, Haojun Shi, Liwei Zhang, Weihao Yuan, Siyu Zhu ·

    BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

    arXiv:2604.16514v4 Announce Type: replace Abstract: Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directl…

  605. arXiv cs.CV TIER_1 English(EN) · Shaotian Li, Shangze Li, Chuancheng Shi, Wenhua Wu, Yanqiu Wu, Xiaohan Yu, Fei Shen, Tat-Seng Chua ·

    Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models

    arXiv:2604.07802v2 Announce Type: replace Abstract: Large-scale vision-language models (VLMs) exhibit remarkable zero-shot capabilities, yet the internal mechanisms driving their anomaly detection (AD) performance remain poorly understood. Current methods predominantly treat VLMs…

  606. arXiv cs.CV TIER_1 English(EN) · Zhihai He ·

    CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

    Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade…

  607. arXiv cs.CV TIER_1 English(EN) · Yang Shi ·

    Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift

    Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modalit…

  608. arXiv cs.CV TIER_1 English(EN) · Ji-Rong Wen ·

    Improving Vision-language Models with Perception-centric Process Reward Models

    Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain.…

  609. arXiv cs.CV TIER_1 English(EN) · Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, Lijuan Wang ·

    V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models

    arXiv:2504.06148v3 Announce Type: replace Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in visual-text processing. However, existing static image-text benchmarks are insufficient for evaluating their dynamic pe…

  610. arXiv cs.CV TIER_1 English(EN) · Kaiwen Long ·

    LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

    Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this issue by pruning unimportant visual tokens,…

  611. arXiv cs.CV TIER_1 English(EN) · Jingrui He ·

    Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

    Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. …

  612. arXiv cs.CV TIER_1 English(EN) · Mitesh M. Khapra ·

    Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

    Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains…

  613. arXiv cs.CV TIER_1 English(EN) · Rongrong Ji ·

    Prototype-Based Test-Time Adaptation of Vision-Language Models

    Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce …

  614. Anyscale blog TIER_1 English(EN) ·

    Introducing Vision-Language Reinforcement Learning in SkyRL

    SkyRL now supports vision-language model post-training. Run scalable RL and SFT for multimodal models on Ray, ready to run your existing Tinker recipes.

  615. MarkTechPost TIER_1 English(EN) · Asif Razzaq ·

    Zyphra Release Zamba2-VL: Hybrid Mamba2–Transformer Vision-Language Models That Cut Time-to-First-Token by About an Order of Magnitude

    <p>Zyphra has released Zamba2-VL, a family of open vision-language models at 1.2B, 2.7B, and 7B parameters. The models use a hybrid Mamba2 state-space and Transformer backbone, shipping under Apache 2.0. They stay competitive with comparable Transformer VLMs while cutting time-to…

  616. MarkTechPost TIER_1 English(EN) · Sana Hassan ·

    Design a Complete Multimodal RLVR Pipeline with Open-MM-RL, Vision-Language Prompting, Reward Scoring, and GRPO Export

    <p>In this tutorial, we explore the TuringEnterprises/Open-MM-RL dataset as a practical foundation for multimodal reasoning and reinforcement learning with verifiable rewards. We load the dataset, inspect its schema, analyze domains, formats, question lengths, answer types, and i…

  617. dev.to — Anthropic tag TIER_1 English(EN) · Jangwook Kim ·

    Claude Opus 4.7: High-Res Vision, Task Budgets, and Agentic Coding

    <p>Anthropic released Claude Opus 4.7 on April 16, 2026. Three things make this release worth paying attention to if you were on Opus 4.6 and wondering whether it was time to upgrade: a significant jump in image resolution support, a new task budget mechanism for agentic loops, a…

  618. HN — machine learning stories TIER_1 English(EN) · 2bit ·

    FastVLM: Efficient Vision Encoding for Vision Language Models

  619. Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] ·

    LocateAnything Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding ABSTRACT: Overcoming Autoregressive Bottlenecks in VLM Grounding Visio

    LocateAnything Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding ABSTRACT: Overcoming Autoregressive Bottlenecks in VLM Grounding Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, ser…

  620. r/LocalLLaMA TIER_1 English(EN) · /u/Sporeboss ·

    Nvidia LocateAnything - Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding. (10x faster than Qwen3-VL)

    <!-- SC_OFF --><div class="md"><p><a href="https://huggingface.co/nvidia/LocateAnything-3B">https://huggingface.co/nvidia/LocateAnything-3B</a></p> <p><a href="https://github.com/NVlabs/Eagle">https://github.com/NVlabs/Eagle</a></p> <p>demo</p> <p><a href="https://huggingface.co/…

  621. dev.to — LLM tag TIER_1 English(EN) · 丁久 ·

    Multimodal AI Models: Vision, Audio, and Text

    <blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/multimodal-models.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</em></…

  622. dev.to — LLM tag TIER_1 English(EN) · 丁久 ·

    Building Multimodal AI Applications: Vision, Audio, and Text Combined (2026)

    <blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/multimodal-ai-guide.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</em>…

  623. r/MachineLearning TIER_1 English(EN) · /u/Nice-Dragonfly-4823 ·

    How Visual-Language-Action (VLA) Models Work [D]

    <table> <tr><td> <a href="https://www.reddit.com/r/MachineLearning/comments/1svhwtz/how_visuallanguageaction_vla_models_work_d/"> <img alt="How Visual-Language-Action (VLA) Models Work [D]" src="https://external-preview.redd.it/fBpt1C8zS6YDW2Lp0_fnNCU2C0Dw1W3tzt7P4g39SHw.jpeg?wid…

  624. r/singularity TIER_2 English(EN) · /u/Worldly_Evidence9113 ·

    A foundation model of vision, audition, and language for in-silico neuroscience

    <!-- SC_OFF --><div class="md"><p><strong>Research Paper (arXiv)</strong></p> <p>[2605.04326] A foundation model of vision, audition, and language for in-silico neuroscience <a href="https://arxiv.org/abs/2605.04326">https://arxiv.org/abs/2605.04326</a></p> <p><strong>Model Codeb…