PulseAugur
LIVE 11:02:15
research · [189 sources] ·
0
research

Hugging Face showcases new vision language models from Google, Microsoft, and others

Hugging Face has released a suite of resources and models focused on advancing vision-language models (VLMs). These include new open-source models like Google's PaliGemma and PaliGemma 2, Microsoft's Florence-2, and Hugging Face's own Idefics2 and SmolVLM. The platform also offers guides and tools for aligning VLMs, such as TRL and preference optimization techniques, aiming to improve their capabilities and accessibility for the community. AI

Summary written by gemini-2.5-flash-lite from 189 sources. How we write summaries →

IMPACT Expands the ecosystem of open-source vision-language models and provides tools for their alignment and fine-tuning.

RANK_REASON Multiple blog posts detailing new open-source vision-language models and alignment techniques released by various organizations on Hugging Face.

Read on Hugging Face Blog →

COVERAGE [189]

  1. Hugging Face Blog TIER_1 ·

    Vision Language Model Alignment in TRL ⚡️

  2. Hugging Face Blog TIER_1 Dansk(DA) ·

    Vision Language Models (Better, faster, stronger)

  3. Hugging Face Blog TIER_1 Dansk(DA) ·

    SigLIP 2: A Better Multilingual Vision Language Encoder

  4. Hugging Face Blog TIER_1 ·

    Welcome PaliGemma 2 – New vision language models by Google

  5. Hugging Face Blog TIER_1 ·

    SmolVLM - small yet mighty Vision Language Model

  6. Hugging Face Blog TIER_1 ·

    Preference Optimization for Vision Language Models

  7. Hugging Face Blog TIER_1 ·

    Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models

  8. Hugging Face Blog TIER_1 ·

    PaliGemma – Google's Cutting-Edge Open Vision Language Model

  9. Hugging Face Blog TIER_1 ·

    Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

  10. Hugging Face Blog TIER_1 ·

    Vision Language Models Explained

  11. Hugging Face Blog TIER_1 ·

    Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Langage Model

  12. Hugging Face Blog TIER_1 ·

    Accelerating Vision-Language Models: BridgeTower on Habana Gaudi2

  13. Hugging Face Blog TIER_1 ·

    A Dive into Vision-Language Models

  14. arXiv cs.AI TIER_1 (CA) · Yong Yu ·

    MMSkills: Towards Multimodal Skills for General Visual Agents

    Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: …

  15. arXiv cs.CL TIER_1 · Chang D. Yoo ·

    PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

    Reinforcement Learning with Verifiable Rewards (RLVR) traditionally relies on a sparse, outcome-based signal. Recent work shows that providing a fine-grained, model-intrinsic signal (rewarding the confidence growth in the ground-truth answer) effectively improves language reasoni…

  16. arXiv cs.CL TIER_1 · Hinrich Schütze ·

    DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

    Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construct…

  17. arXiv cs.AI TIER_1 · Taesik Gong ·

    Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete

    Vision-Language-Action (VLA) policies are typically evaluated as if the user had finished typing or speaking before the robot begins acting. In real deployment, however, users take several seconds to enter a request, leaving the policy idle for a substantial fraction of the inter…

  18. arXiv cs.CL TIER_1 · Yong Li ·

    UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

    Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with visual latent tokens. This interleaved design limits efficiency and keeps reas…

  19. arXiv cs.CL TIER_1 · Wenxin Yu ·

    Allegory of the Cave: Measurement-Grounded Vision-Language Learning

    Vision-language models typically reason over post-ISP RGB images, although RGB rendering can clip, suppress, or quantize sensor evidence before inference. We study whether grounding improves when the visual interface is moved closer to the underlying camera measurement. We formul…

  20. Hugging Face Daily Papers TIER_1 ·

    CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

    This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can …

  21. arXiv cs.AI TIER_1 · Marcin Chlebus ·

    Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality

    Artificial intelligence models are increasingly scaled to improve predictive accuracy, yet it remains unclear whether scale improves the quality of post-hoc explanations. We investigate this relationship by evaluating 11 computer vision models representing increasing levels of de…

  22. arXiv cs.AI TIER_1 · Xingjun Ma ·

    ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

    Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In this paper, we take a training-free persp…

  23. arXiv cs.AI TIER_1 · Mattia Rigotti ·

    GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

    Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision…

  24. arXiv cs.AI TIER_1 · Zhenbo Xu ·

    RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

    Platform content moderation applies explicit policy rules and context-dependent conditions to decide whether user content is allowed, restricted, or removed. A correct moderation outcome must therefore depend on which rules a case activates, how those rules interact, and whether …

  25. arXiv cs.LG TIER_1 · Chenyu Huang, Peng Ye, Xudong Tan, Jinhan Mu, Shenghe Zheng, Li Shen, Tao Chen ·

    FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models

    arXiv:2601.21187v2 Announce Type: replace-cross Abstract: Efficiently enhancing the reasoning capabilities of Vision-Language Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a co…

  26. arXiv cs.LG TIER_1 · Yuhua Jiang, Shuang Cheng, Yan Ding, Feifei Gao, Biqing Qi ·

    AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

    arXiv:2511.14148v2 Announce Type: replace-cross Abstract: Vision-language-action (VLA) models have recently emerged as a powerful paradigm for building generalist robots. However, traditional VLA models that generate actions through flow matching (FM) typically rely on rigid and …

  27. arXiv cs.LG TIER_1 · Shuyang Jiang, Nan Yu, Yiming Zhang, Zenghui Ding, Zhenyu Wu ·

    DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency

    arXiv:2605.06592v1 Announce Type: cross Abstract: Contrastive language-image pretraining (CLIP) suffers from two structural weaknesses: the symmetric InfoNCE loss discards the relative ordering among unmatched in-batch pairs, and global pooling collapses the visual representation…

  28. arXiv cs.LG TIER_1 · Cheng Xu, Xiaofeng Hou, Jiacheng Liu, Chao Li ·

    VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading

    arXiv:2605.05899v1 Announce Type: new Abstract: Large-scale vision-language mixture-of-experts (VL-MoE) models provide strong multimodal capability, but efficient deployment on memory-constrained platforms remains difficult. Existing MoE offloading systems are largely designed fo…

  29. arXiv cs.AI TIER_1 · Zehao Deng, Tianjie Ju, Zheng Wu, Liangbo He, Jun Lan, Huijia Zhu, Weiqiang Wang, Zhuosheng Zhang ·

    Causal Probing for Internal Visual Representations in Multimodal Large Language Models

    arXiv:2605.05593v1 Announce Type: new Abstract: Despite the remarkable success of Multimodal Large Language Models (MLLMs) across diverse tasks, the internal mechanisms governing how they encode and ground distinct visual concepts remain poorly understood. To bridge this gap, we …

  30. arXiv cs.LG TIER_1 · Numan Saeed, Asif Hanif, Fadillah Adamsyah Maani, Hussain Alasmawi, Mohammad Yaqub ·

    DARK: Diagonal-Anchored Repulsive Knowledge Distillation for Vision-Language Models under Extreme Compression

    arXiv:2603.05421v3 Announce Type: replace-cross Abstract: Compressing vision-language models for on-device deployment is increasingly important in clinical settings, but knowledge distillation (KD) degrades sharply when the teacher-student capacity gap spans an order of magnitude…

  31. arXiv cs.AI TIER_1 · Yuxuan Wu, Guangming Wang, Zhiheng Yang, Maoqing Yao, Brian Sheil, Hesheng Wang ·

    Continually Evolving Skill Knowledge in Vision Language Action Model

    arXiv:2511.18085v3 Announce Type: replace-cross Abstract: Vision-language-action (VLA) models show promising knowledge accumulation ability from pretraining, yet continual learning in VLA remains challenging, especially for efficient adaptation. Existing continual imitation learn…

  32. arXiv cs.LG TIER_1 · Binyu Zhao, Wei Zhang, Xingrui Yu, Zhaonian Zou, Ivor Tsang ·

    Advancing Analytic Class-Incremental Learning through Vision-Language Calibration

    arXiv:2602.13670v2 Announce Type: replace Abstract: Class-incremental learning (CIL) with pre-trained models (PTMs) faces a critical trade-off between efficient adaptation and long-term stability. While analytic learning enables rapid, recursive closed-form updates, its efficacy …

  33. arXiv cs.LG TIER_1 · St\'ephane d'Ascoli, J\'er\'emy Rapin, Yohann Benchetrit, Teon Brooks, Katelyn Begany, Jos\'ephine Raugel, Hubert Banville, Jean-R\'emi King ·

    A foundation model of vision, audition, and language for in-silico neuroscience

    arXiv:2605.04326v1 Announce Type: cross Abstract: Cognitive neuroscience is fragmented into specialized models, each tailored to specific experimental paradigms, hence preventing a unified model of cognition in the human brain. Here, we introduce TRIBE v2, a tri-modal (video, aud…

  34. arXiv cs.AI TIER_1 · Shule Lu, Yujing Wang, Hainan Zhang, Xiaoshan Yang, Hongwei Zheng, Yongxin Tong, Changsheng Xu, Zhiming Zheng ·

    Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models

    arXiv:2605.03426v1 Announce Type: new Abstract: Vision-Language Models (VLMs) have broad potential in privacy-sensitive domains such as healthcare and finance, yet strict data-sharing constraints render centralized training infeasible. Federated Learning mitigates this issue by e…

  35. arXiv cs.AI TIER_1 · Yuanyuan Jia, Shunpu Tang, Qianqian Yang ·

    CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding

    arXiv:2605.02218v1 Announce Type: new Abstract: Vision-language models (VLMs) have demonstrated strong capabilities in multimodal perception and reasoning. However, deploying large VLMs on mobile devices remains challenging due to their substantial computational and memory demand…

  36. Hugging Face Daily Papers TIER_1 ·

    A foundation model of vision, audition, and language for in-silico neuroscience

    Cognitive neuroscience is fragmented into specialized models, each tailored to specific experimental paradigms, hence preventing a unified model of cognition in the human brain. Here, we introduce TRIBE v2, a tri-modal (video, audio and language) foundation model capable of predi…

  37. arXiv cs.AI TIER_1 · Magdalena Katharina Wekenborg ·

    Quantifying the human visual exposome with vision language models

    The visual environment is a fundamental yet unquantified determinant of mental health. While the concept of the environmental exposome is well established, current methods rely on coarse geospatial proxies or biased self reports, failing to capture the first person visual context…

  38. arXiv cs.AI TIER_1 · Shengzhao Wen ·

    MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models

    Multidimensional human understanding is essential for real-world applications such as film analysis and virtual digital humans, yet current LVLM benchmarks largely focus on single-task settings and lack fine-grained, human-centric evaluation. In this work, we introduce MHPR, a co…

  39. Hugging Face Daily Papers TIER_1 ·

    Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model

    Source-Free Domain Adaptation (SFDA) adapts source models to target domains without accessing source data, addressing privacy and transmission issues. However, existing methods still initialize from a source pre-trained model and thus are not truly source-free. Recent works have …

  40. Hugging Face Daily Papers TIER_1 ·

    SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters

    Vision Foundation Models (VFMs) pretrained on large-scale RGB data have demonstrated remarkable representation quality, yet their applicability to multispectral imaging spanning Near-Infrared (NIR), Short-Wave Infrared (SWIR), and Long-Wave Infrared (LWIR) remains largely unexplo…

  41. arXiv cs.AI TIER_1 · Kenneth J. K. Ong ·

    The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models

    arXiv:2604.27953v1 Announce Type: new Abstract: As Vision-Language Models (VLMs) become increasingly integrated into decision-making systems, it is essential to understand how visual inputs influence their behavior. This paper investigates the effects of visual priming on VLMs' c…

  42. arXiv cs.AI TIER_1 · Ken Deng, Yifu Qiu, Yoni Kasten, Shay B. Cohen, Yftah Ziser ·

    Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation

    arXiv:2601.22228v2 Announce Type: replace-cross Abstract: We study whether vision-language models (VLMs) can solve relative camera pose estimation (RCPE) from image pairs, a direct test of multi-view spatial reasoning. We cast RCPE as a discrete verbal classification task and int…

  43. arXiv cs.AI TIER_1 · Santosh Vasa, Aditi Ramadwar, Jnana Rama Krishna Darabattula, Md Zafar Anwar, Stanislaw Antol, Andrei Vatavu, Thomas Monninger, Sihao Ding ·

    AutoVDC: Automated Vision Data Cleaning Using Vision-Language Models

    arXiv:2507.12414v2 Announce Type: replace-cross Abstract: Training of autonomous driving systems requires extensive datasets with precise annotations to attain robust performance. Human annotations suffer from imperfections, and multiple iterations are often needed to produce hig…

  44. arXiv cs.CL TIER_1 · Chahat Raj, Bowen Wei, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu ·

    VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models

    arXiv:2505.22897v2 Announce Type: replace Abstract: While bias in large language models (LLMs) is well-studied, similar concerns in vision-language models (VLMs) have received comparatively less attention. Existing VLM bias studies often focus on portrait-style images and gender-…

  45. arXiv cs.CL TIER_1 · Alice Plebe, Timothy Douglas, Diana Riazi, R. Maria del Rio-Chanona ·

    Images Amplify Misinformation Sharing in Vision-Language Models

    arXiv:2505.13302v2 Announce Type: replace Abstract: As language and vision-language models (VLMs) become central to information access and online interaction, concerns grow about their potential to amplify misinformation. Human studies show that images boost the perceived credibi…

  46. arXiv cs.CL TIER_1 · Yuling Shi, Chaoxiang Xie, Zhensu Sun, Yeheng Chen, Chenxu Zhang, Longfei Yun, Chengcheng Wan, Hongyu Zhang, David Lo, Xiaodong Gu ·

    CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

    arXiv:2602.01785v2 Announce Type: replace Abstract: Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text…

  47. arXiv cs.AI TIER_1 · Kaijun Zhou, Qiwei Chen, Da Peng, Zhiyang Li, Xijun Li, Jinyu Gu ·

    Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

    arXiv:2604.24447v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models are promising for generalist robot control, but on-robot deployment is bottlenecked by real-time inference under tight cost and energy budgets. Most prior evaluations rely on desktop-grade GPUs,…

  48. arXiv cs.CL TIER_1 · Qidong Wang, Junjie Hu, Ming Jiang ·

    V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models

    arXiv:2509.14837v2 Announce Type: replace Abstract: Recent advances in causal interpretability have extended from language models to vision-language models (VLMs), seeking to reveal their internal mechanisms through input interventions. While textual interventions often target se…

  49. arXiv cs.LG TIER_1 · Charles Xu, Jost Tobias Springenberg, Michael Equi, Ali Amin, Adnan Esmail, Sergey Levine, Liyiming Ke ·

    RL Token: Bootstrapping Online RL with Vision-Language-Action Models

    arXiv:2604.23073v1 Announce Type: new Abstract: Vision-language-action (VLA) models can learn to perform diverse manipulation skills "out of the box," but achieving the precision and speed that real-world tasks demand requires further fine-tuning -- for example, via reinforcement…

  50. arXiv cs.AI TIER_1 · Ziyao Wang, Bingying Wang, Hanrong Zhang, Tingting Du, Tianyang Chen, Guoheng Sun, Yexiao He, Zheyu Shen, Wanghao Ye, Ang Li ·

    Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

    arXiv:2604.23001v1 Announce Type: cross Abstract: Despite remarkable progress in Vision--Language--Action (VLA) models, a central bottleneck remains underexamined: the data infrastructure that underlies embodied learning. In this survey, we argue that future advances in VLA will …

  51. arXiv cs.CL TIER_1 · Yiran Huang, Lukas Thede, Massimiliano Mancini, Wenjia Xu, Zeynep Akata ·

    Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

    arXiv:2604.24380v1 Announce Type: new Abstract: While Large Vision Language Models (LVLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose deployment challenges on resource-constrained edge devices. Current parameter reduction tec…

  52. Hugging Face Daily Papers TIER_1 ·

    CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

    Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade…

  53. Hugging Face Daily Papers TIER_1 ·

    Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift

    Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modalit…

  54. arXiv cs.AI TIER_1 · Jinyu Gu ·

    Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

    Vision-Language-Action (VLA) models are promising for generalist robot control, but on-robot deployment is bottlenecked by real-time inference under tight cost and energy budgets. Most prior evaluations rely on desktop-grade GPUs, obscuring the trade-offs and opportunities offere…

  55. Hugging Face Daily Papers TIER_1 ·

    Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

    While Large Vision Language Models (LVLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose deployment challenges on resource-constrained edge devices. Current parameter reduction techniques primarily involve training LVLMs from sm…

  56. arXiv cs.CL TIER_1 · Zeynep Akata ·

    Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

    While Large Vision Language Models (LVLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose deployment challenges on resource-constrained edge devices. Current parameter reduction techniques primarily involve training LVLMs from sm…

  57. arXiv cs.CL TIER_1 · Etha Tianze Hua, Tian Yun, Ellie Pavlick ·

    Source-Modality Monitoring in Vision-Language Models

    arXiv:2604.22038v1 Announce Type: new Abstract: We define and investigate source-modality monitoring -- the ability of multimodal models to track and communicate the input source from which pieces of information originate. We consider source-modality monitoring as an instance of …

  58. Hugging Face Daily Papers TIER_1 ·

    LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

    Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this issue by pruning unimportant visual tokens,…

  59. arXiv cs.CL TIER_1 · Ellie Pavlick ·

    Source-Modality Monitoring in Vision-Language Models

    We define and investigate source-modality monitoring -- the ability of multimodal models to track and communicate the input source from which pieces of information originate. We consider source-modality monitoring as an instance of the more general binding problem, and evaluate t…

  60. Hugging Face Daily Papers TIER_1 ·

    Prototype-Based Test-Time Adaptation of Vision-Language Models

    Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce …

  61. Hugging Face Daily Papers TIER_1 ·

    Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

    Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video-LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In th…

  62. Hugging Face Daily Papers TIER_1 ·

    More Than Meets the Eye: Measuring the Semiotic Gap in Vision-Language Models via Semantic Anchorage

    Vision-Language Models (VLMs) excel at photorealistic generation, yet often struggle to represent abstract meaning such as idiomatic interpretations of noun compounds. To study whether high visual fidelity interferes with idiomatic compositionality under visual abstraction, we in…

  63. arXiv cs.CV TIER_1 · Yangqiu Song ·

    Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

    Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insuff…

  64. arXiv cs.CV TIER_1 · Lei Zhang ·

    Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

    In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models learn a direct "Sense-to-Act" m…

  65. arXiv cs.CV TIER_1 · Yadong Mu ·

    RotVLA: Rotational Latent Action for Vision-Language-Action Model

    Latent Action Models (LAMs) have emerged as an effective paradigm for handling heterogeneous datasets during Vision-Language-Action (VLA) model pretraining, offering a unified action space across embodiments. However, existing LAMs often rely on discrete quantization encode and d…

  66. arXiv cs.CV TIER_1 · Ting Cao ·

    GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models

    In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, vi…

  67. arXiv cs.CV TIER_1 · Stefano Peluchetti ·

    KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models

    Kamon (family crests) are an important part of Japanese culture and a natural test case for compositional visual recognition: each crest combines a small number of symbolic choices, but the space of possible descriptions is sparse. We introduce KamonBench, a grammar-based image-t…

  68. arXiv cs.CV TIER_1 · Yukyung Choi ·

    CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

    In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less informative visual tokens for image understanding tasks. Howeve…

  69. arXiv cs.CV TIER_1 · Jingyuan Chen ·

    A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning

    Efficient transfer learning methods for large-scale vision-language models ($e.g.$, CLIP) enable strong few-shot transfer, yet existing adaptation methods follow a fixed fine-tuning paradigm that implicitly assumes a uniform importance of the image and text branches, which has no…

  70. arXiv cs.CV TIER_1 · Siheng Chen ·

    Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

    Vision-language-action (VLA) models are effective robot action executors, but they remain limited on long-horizon tasks due to the dual burden of extended closed-loop planning and diverse physical operations. We therefore propose VLAs-as-Tools, a strategy that distributes this bu…

  71. arXiv cs.CV TIER_1 · Sangdoo Yun ·

    Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

    When humans describe a visual scene, they do not process the entire image uniformly; instead, they selectively fixate on regions relevant to their intended description. In contrast, current multimodal large language models (MLLMs) attend to all visual tokens at each generation st…

  72. arXiv cs.CV TIER_1 · Miguel P. Eckstein ·

    Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency

    Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate c…

  73. arXiv cs.CV TIER_1 · Zitong Yu ·

    Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

    Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large La…

  74. arXiv cs.CV TIER_1 · Alan Yuille ·

    LychSim: A Controllable and Interactive Simulation Framework for Vision Research

    While self-supervised pretraining has reduced vision systems' reliance on synthetic data, simulation remains an indispensable tool for closed-loop optimization and rigorous out-of-distribution (OOD) evaluation. However, modern simulation platforms often present steep technical ba…

  75. arXiv cs.CV TIER_1 · Guanjun Jiang ·

    Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

    Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We id…

  76. arXiv cs.CV TIER_1 · Feng Dai ·

    VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference

    Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses t…

  77. arXiv cs.CV TIER_1 · Yulun Zhang ·

    G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

    The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual token reduction for improving the efficiency of separate-encoder UMMs. While…

  78. arXiv cs.CV TIER_1 · Xun Wang ·

    Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

    While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images…

  79. arXiv cs.CV TIER_1 · Zheng-Jun Zha ·

    Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

    In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent v…

  80. arXiv cs.CV TIER_1 · Philipp Johannes Schubert ·

    BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding

    Scene understanding is central to general physical intelligence, and video is a primary modality for capturing both state and temporal dynamics of a scene. Yet understanding physical processes remains difficult, as models must combine object localization, hand-object interactions…

  81. arXiv cs.CV TIER_1 · Fei Tian ·

    Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audi…

  82. arXiv cs.CV TIER_1 · Paolo Soda ·

    Resilient Vision-Tabular Multimodal Learning under Modality Missingness

    Multimodal deep learning has shown strong potential in medical applications by integrating heterogeneous data sources such as medical images and structured clinical variables. However, most existing approaches implicitly assume complete modality availability, an assumption that r…

  83. arXiv cs.CV TIER_1 · Chenggang Yan ·

    Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models

    Prompt learning has emerged as an efficient alternative to fine-tuning pre-trained vision-language models (VLMs). Despite its promise, current methods still struggle to maintain tail-class discriminability when adapting to class-imbalanced datasets. In this work, we propose clust…

  84. arXiv cs.CV TIER_1 · Dacheng Tao ·

    Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training

    Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enablin…

  85. arXiv cs.CV TIER_1 · Haoang Li ·

    CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

    This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can …

  86. arXiv cs.CV TIER_1 · Gang Pan ·

    ALAM: Algebraically Consistent Latent Transitions for Vision-Language-Action Models

    Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstr…

  87. arXiv cs.CV TIER_1 · Shen Li ·

    C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving

    Safety-critical planning in complex environments, particularly at urban intersections, remains a fundamental challenge for autonomous driving. Existing methods, whether rule-based or data-driven, frequently struggle to capture complex scene semantics, infer potential risks, and m…

  88. arXiv cs.CV TIER_1 · Qingyao Wu ·

    Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium

    During MLLM decoding, attention often abnormally concentrates on irrelevant image tokens. While existing research dismisses this as invalid noise and forcibly redirects attention to compel focusing on key image information, we argue these tokens are critical carriers of visual an…

  89. arXiv cs.CV TIER_1 · Vasileios Mezaris ·

    LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models

    Large Vision-Language Models (VLMs) are successful in addressing a multitude of vision-language understanding tasks, such as Visual Question Answering (VQA), but their memory and compute requirements remain a concern for practical deployment. A promising class of techniques for m…

  90. arXiv cs.CV TIER_1 · Wei He ·

    SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

    Low-level visual perception underpins reliable remote sensing (RS) image analysis, yet current image quality assessment (IQA) methods output uninterpretable scalar scores rather than characterizing physics-driven RS degradations, deviating markedly from the diagnostic needs of RS…

  91. arXiv cs.CV TIER_1 · Wenzhao Zheng ·

    Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

    Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision m…

  92. arXiv cs.CV TIER_1 · Jinsong Su ·

    Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models

    Vision-language models (VLMs) raise growing concerns about privacy, copyright, and bias, motivating machine unlearning to remove sensitive knowledge. However, existing methods primarily fine-tune the language decoder, leading to superficial forgetting that fails to erase underlyi…

  93. arXiv cs.CV TIER_1 · Plachetka Christopher ·

    Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models

    Over the last few years, research on autonomous systems has matured to such a degree that the field is increasingly well-positioned to translate research into practical, stakeholder-driven use cases across well-defined domains. However, for a wide-scale practical adoption of auto…

  94. arXiv cs.CV TIER_1 · Zhanyu Ma ·

    PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

    Mainstream vision-language models (VLMs) fundamentally struggle with severe optical ambiguities, such as reflections and transparent objects, due to the inherent limitations of standard RGB inputs. While polarization imaging captures polarimetric physical parameters that resolve …

  95. arXiv cs.CV TIER_1 · Cheng Deng ·

    Hierarchical Dual-Subspace Decoupling for Continual Learning in Vision-Language Models

    Class-incremental learning aims to continuously acquire new knowledge while preserving previously learned information, thereby mitigating catastrophic forgetting. Existing methods primarily restrict parameter updates but often overlook their structural properties in high-dimensio…

  96. arXiv cs.CV TIER_1 · Cheng Deng ·

    DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models

    Continual learning enables vision-language models to accumulate knowledge and adapt to evolving tasks without retraining from scratch. However, in multi-domain task-incremental learning, large domain shifts intensify the stability-plasticity dilemma. Most existing methods rely on…

  97. arXiv cs.CV TIER_1 · Zheng Li, Jerry Cheng, Huanying Helen Gu ·

    StableTTA: Improving Vision Model Performance by Training-free Test-Time Adaptation Methods

    arXiv:2604.04552v3 Announce Type: replace Abstract: Ensemble methods improve predictive performance but often incur high memory and computational costs. We identify an aggregation instability induced by nonlinear projection and voting operations. To address both efficiency challe…

  98. arXiv cs.CV TIER_1 Deutsch(DE) · Gongli Xi, Ye Tian, Mengyu Yang, Huahui Yi, Liang Lin, Xiaoshuai Hao, Kun Wang, Wendong Wang ·

    Large Vision-Language Models Get Lost in Attention

    arXiv:2605.05668v1 Announce Type: cross Abstract: Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct…

  99. arXiv cs.CV TIER_1 · Jintao Sun, Gangyi Ding, Donglin Di, Hu Zhang, Zhedong Zheng ·

    Can Vision-Language Models Think from the Sky? Unifying UAV Reasoning and Generation

    arXiv:2604.05377v2 Announce Type: replace Abstract: Vision-Language Models have achieved strong progress in ground-view visual understanding, yet they remain brittle in high-altitude Unmanned Aerial Vehicle scenes, where objects are tiny and densely packed, textures are repetitiv…

  100. arXiv cs.CV TIER_1 · Zhenyu Wu ·

    DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency

    Contrastive language-image pretraining (CLIP) suffers from two structural weaknesses: the symmetric InfoNCE loss discards the relative ordering among unmatched in-batch pairs, and global pooling collapses the visual representation into a semantic bottleneck that is poorly sensiti…

  101. arXiv cs.CV TIER_1 · Jiajin Guan (Research Institute of Electronic Science and Technology, University of Electronic Science and Technology of China, Chengdu, China), Haibo Mei (School of Aeronautics and Astronautics, University of Electronic Science and Technology of China, C ·

    UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning

    arXiv:2508.11196v2 Announce Type: replace Abstract: Recent advances in vision-language models (VLMs) have demonstrated strong generalization in natural image tasks. However, their performance often degrades on unmanned aerial vehicle (UAV)-based aerial imagery, which features hig…

  102. arXiv cs.CV TIER_1 · Yihan Lin, Haoyang Li, Yang Li, Haitao Shen, Yihan Zhao, Chao Shao, Jing Zhang ·

    From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

    arXiv:2605.04678v1 Announce Type: cross Abstract: Latent actions serve as an intermediate representation that enables consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs with latent actions are fragment…

  103. arXiv cs.CV TIER_1 · Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Libo Qin, Ruihan Chen, Lei Huang, Baohang Li, Kui Jiang, Yaowei Wang, Ting Liu, Bing Qin ·

    CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

    arXiv:2605.04641v1 Announce Type: new Abstract: Although Large Vision-Language Models (LVLMs) have demonstrated remarkable performance on downstream tasks, they frequently produce contents that deviate from visual information, leading to object hallucination. To tackle this, rece…

  104. arXiv cs.CV TIER_1 · Jing Zhang ·

    From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

    Latent actions serve as an intermediate representation that enables consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs with latent actions are fragmented and lack a systematic comparison. This work str…

  105. arXiv cs.CV TIER_1 · Bing Qin ·

    CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

    Although Large Vision-Language Models (LVLMs) have demonstrated remarkable performance on downstream tasks, they frequently produce contents that deviate from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annot…

  106. arXiv cs.CV TIER_1 · Parsa Madinei, Srijita Karmakar, Russell Cohen Hoffing, Felix Gervitz, Miguel P. Eckstein ·

    IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models

    arXiv:2602.16138v2 Announce Type: replace Abstract: We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 uniqu…

  107. arXiv cs.CV TIER_1 · Christian Rominger (University of Graz), Andreas R. Schwerdtfeger (University of Graz), Malay Gaherwar Singh (TU Dresden), Dimitri Khudyakow (TU Dresden), Elizabeth A. M. Michels (TU Dresden), Fabian Wolf (TU Dresden), Jakob Nikolas Kather (TU Dresden, Un ·

    Quantifying the human visual exposome with vision language models

    arXiv:2605.03863v1 Announce Type: cross Abstract: The visual environment is a fundamental yet unquantified determinant of mental health. While the concept of the environmental exposome is well established, current methods rely on coarse geospatial proxies or biased self reports, …

  108. arXiv cs.CV TIER_1 · Xiaowen Sun, Matthias Kerzel, Mengdi Li, Xufeng Zhao, Paul Striker, Stefan Wermter ·

    StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning

    arXiv:2605.03927v1 Announce Type: new Abstract: Vision-language models (VLMs) have shown remarkable performance in various robotic tasks, as they can perceive visual information and understand natural language instructions. However, when applied to robotics, VLMs remain subject t…

  109. arXiv cs.CV TIER_1 · Kangkang Wang, Qinting Jiang, Wanping Zhang, Bowen Ren, Shengzhao Wen ·

    MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models

    arXiv:2605.03485v1 Announce Type: new Abstract: Multidimensional human understanding is essential for real-world applications such as film analysis and virtual digital humans, yet current LVLM benchmarks largely focus on single-task settings and lack fine-grained, human-centric e…

  110. arXiv cs.CV TIER_1 · Yujun Li, Hongyuan Zhang, Yuan Yuan ·

    GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning

    arXiv:2605.03403v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO) has recently shown strong performance in post-training large language models and vision-language models. It raises a question of whether the GRPO also significantly promotes the test-time ad…

  111. arXiv cs.CV TIER_1 · Sen Nie, Jie Zhang, Zhongqi Wang, Zhaoyang Wei, Shiguang Shan, Xilin Chen ·

    What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

    arXiv:2603.12799v2 Announce Type: replace Abstract: Achieving adversarial robustness in Vision-Language Models (VLMs) inevitably compromises accuracy on clean data, presenting a long-standing and challenging trade-off. In this work, we revisit this trade-off by investigating a fu…

  112. arXiv cs.CV TIER_1 · JF Bastien, Sam D'Amico ·

    VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models

    arXiv:2605.03351v1 Announce Type: new Abstract: Video vision-language models (VLMs) keep paying for visual state the stream already told us was stable. The factory wall did not move, but most VLM pipelines still hand the model dense RGB frames or a fresh prefix again. We study th…

  113. arXiv cs.CV TIER_1 · Stefan Wermter ·

    StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning

    Vision-language models (VLMs) have shown remarkable performance in various robotic tasks, as they can perceive visual information and understand natural language instructions. However, when applied to robotics, VLMs remain subject to a fundamental limitation inherent in large lan…

  114. arXiv cs.CV TIER_1 · Yuan Yuan ·

    GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning

    Group Relative Policy Optimization (GRPO) has recently shown strong performance in post-training large language models and vision-language models. It raises a question of whether the GRPO also significantly promotes the test-time adaptation (TTA) of vision language models. In thi…

  115. arXiv cs.CV TIER_1 · Sam D'Amico ·

    VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models

    Video vision-language models (VLMs) keep paying for visual state the stream already told us was stable. The factory wall did not move, but most VLM pipelines still hand the model dense RGB frames or a fresh prefix again. We study that waste as training-free anti-recomputation: re…

  116. arXiv cs.CV TIER_1 · \"Umit Mert \c{C}a\u{g}lar, Alptekin Temizel ·

    Grounding Synthetic Data Generation With Vision and Language Models

    arXiv:2603.09625v2 Announce Type: replace Abstract: Deep learning models benefit from increasing data diversity and volume, motivating synthetic data augmentation to improve existing datasets. However, existing evaluation metrics for synthetic data typically calculate latent feat…

  117. arXiv cs.CV TIER_1 · Anjie Liu, Ziqin Gong, Yan Song, Yuxiang Chen, Xiaolong Liu, Hengtong Lu, Kaike Zhang, Chen Wei ·

    Active Reasoning Vision-Language Models via Sequential Experimental Design

    arXiv:2605.01345v1 Announce Type: new Abstract: Visual perception in modern Vision-Language Models (VLMs) is constrained by a fundamental perceptual bandwidth bottleneck: a broad field of view inevitably sacrifices the fine-grained details necessary for complex reasoning. Inspire…

  118. arXiv cs.CV TIER_1 · Yin Zhang, Jiaxuan Zhao, Zonghan Wu, Zengxiang Li, Junfeng Fang, Kun Wang, Qingsong Wen, Yilei Shao ·

    MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models

    arXiv:2605.01520v1 Announce Type: new Abstract: Vision-Language Models (VLMs) frequently suffer from visual perception errors and hallucinations that compromise answer accuracy in complex reasoning tasks. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising so…

  119. arXiv cs.CV TIER_1 · Zeshang Li, Shuoyang Zhang, Jiashen Ding ·

    GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models

    arXiv:2605.01733v1 Announce Type: new Abstract: Vision-Language Models (VLMs) excel at grounded reasoning but remain prone to object hallucination. Recent work treats self-generated captions as a uniformly positive resource, yet we find that naively embedding one can degrade rath…

  120. arXiv cs.CV TIER_1 · Yagiz Nalcakan, Hyeongjin Ju, Incheol Park, Sanghyeop Yeo, Youngwan Jin, Shiho Kim ·

    SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters

    arXiv:2605.02258v1 Announce Type: new Abstract: Vision Foundation Models (VFMs) pretrained on large-scale RGB data have demonstrated remarkable representation quality, yet their applicability to multispectral imaging spanning Near-Infrared (NIR), Short-Wave Infrared (SWIR), and L…

  121. arXiv cs.CV TIER_1 · Zhou Bingtao, Xiang Mian, Ning Qian ·

    Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model

    arXiv:2605.02604v1 Announce Type: new Abstract: Source-Free Domain Adaptation (SFDA) adapts source models to target domains without accessing source data, addressing privacy and transmission issues. However, existing methods still initialize from a source pre-trained model and th…

  122. arXiv cs.CV TIER_1 · Chenyu Hui, Xiaodi Huang, Siyu Xu, Yunke Wang, Shan You, Fei Wang, Tao Huang, Chang Xu ·

    Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation

    arXiv:2605.02757v1 Announce Type: new Abstract: Vision-language-action (VLA) models typically rely on large-scale real-world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substantial visual domain gap and limi…

  123. arXiv cs.CV TIER_1 (ET) · Andreas Koukounas, Georgios Mastrapas, Florian H\"onicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao ·

    jina-vlm: Small Multilingual Vision Language Model

    arXiv:2512.04032v3 Announce Type: replace-cross Abstract: We present jina-vlm, a token-efficient 2.4B parameter vision-language model that achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 lang…

  124. arXiv cs.CV TIER_1 · Chang Xu ·

    Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation

    Vision-language-action (VLA) models typically rely on large-scale real-world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substantial visual domain gap and limited environmental diversity, resulting in weak r…

  125. arXiv cs.CV TIER_1 · Ning Qian ·

    Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model

    Source-Free Domain Adaptation (SFDA) adapts source models to target domains without accessing source data, addressing privacy and transmission issues. However, existing methods still initialize from a source pre-trained model and thus are not truly source-free. Recent works have …

  126. arXiv cs.CV TIER_1 · Shiho Kim ·

    SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters

    Vision Foundation Models (VFMs) pretrained on large-scale RGB data have demonstrated remarkable representation quality, yet their applicability to multispectral imaging spanning Near-Infrared (NIR), Short-Wave Infrared (SWIR), and Long-Wave Infrared (LWIR) remains largely unexplo…

  127. arXiv cs.CV TIER_1 · Jiayu Li, Jiaxin Qi, Sheng Zhou, Jiaqiang Huang, Xiansheng Hua ·

    Intrinsic Gradient Suppression for Label-Noise Prompt Tuning in Vision-Language Models

    arXiv:2605.00591v1 Announce Type: new Abstract: Contrastive vision-language models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can ove…

  128. arXiv cs.CV TIER_1 · Minghui Chen, Chenxu Yang, Hengjie Zhu, Dayan Wu, Zheng Lin, Qingyi Si ·

    Online Self-Calibration Against Hallucination in Vision-Language Models

    arXiv:2605.00323v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from…

  129. arXiv cs.CV TIER_1 · Yan Fang, Mengcheng Lan, Zilong Huang, Weixian Lei, Yunqing Zhao, Yujie Zhong, Yingchen Yu, Qi She, Yao Zhao, Yunchao Wei ·

    Let ViT Speak: Generative Language-Image Pre-training

    arXiv:2605.00809v1 Announce Type: new Abstract: In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language model…

  130. arXiv cs.CV TIER_1 · Aharon Azulay, Jan Dubi\'nski, Zhuoyun Li, Atharv Mittal, Yossi Gandelsman ·

    Jailbreaking Vision-Language Models Through the Visual Modality

    arXiv:2605.00583v1 Announce Type: new Abstract: The visual modality of vision-language models (VLMs) is an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks exploiting the vision component: (1) encoding harmful instructions as visual…

  131. arXiv cs.CV TIER_1 · Phuong Ngoc Nguyen, Kaito Shiku, Ryoma Bise, Seiichi Uchida, Shinnosuke Matsuo ·

    Leveraging Vision-Language Models as Weak Annotators in Active Learning

    arXiv:2605.00480v1 Announce Type: new Abstract: Active learning aims to reduce annotation cost by selectively querying informative samples for supervision under a limited labeling budget. In this work, we investigate how vision-language models (VLMs) can be leveraged to further r…

  132. arXiv cs.CV TIER_1 · Yunchao Wei ·

    Let ViT Speak: Generative Language-Image Pre-training

    In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with …

  133. arXiv cs.CV TIER_1 · Xiansheng Hua ·

    Intrinsic Gradient Suppression for Label-Noise Prompt Tuning in Vision-Language Models

    Contrastive vision-language models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can overwhelm pre-trained priors. We argue that because…

  134. arXiv cs.CV TIER_1 · Yossi Gandelsman ·

    Jailbreaking Vision-Language Models Through the Visual Modality

    The visual modality of vision-language models (VLMs) is an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks exploiting the vision component: (1) encoding harmful instructions as visual symbol sequences with a decoding legend, (2) re…

  135. arXiv cs.CV TIER_1 · Shinnosuke Matsuo ·

    Leveraging Vision-Language Models as Weak Annotators in Active Learning

    Active learning aims to reduce annotation cost by selectively querying informative samples for supervision under a limited labeling budget. In this work, we investigate how vision-language models (VLMs) can be leveraged to further reduce the reliance on costly human annotation wi…

  136. arXiv cs.CV TIER_1 · Mengling Deng, Yuanpeng Chen, Sheng Yang, Wei Tao, Wenhai Zhang, Hui Song, Linyuanhao Qin, Kai Zhao, Xiaojun Ye, Shanhui Mo, Jingli Fan, Shuang Zhang, Bei Liu, Tiankun Zhao, Xiangjing An ·

    EdgeFM: Efficient Edge Inference for Vision-Language Models

    arXiv:2604.27476v1 Announce Type: new Abstract: Vision-language models (VLMs) have demonstrated strong applicability in edge industrial applications, yet their deployment remains severely constrained by requirements for deterministic low latency and stable execution under resourc…

  137. arXiv cs.CV TIER_1 · Hyeonseo Jang, Jaebyeong Jeon, Joong-Won Hwang, Kibok Lee ·

    Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining

    arXiv:2604.27715v1 Announce Type: new Abstract: Test-time prompt tuning (TPT) has emerged as a promising technique for enhancing the adaptability of vision-language models by optimizing textual prompts using unlabeled test data. However, prior studies have observed that TPT often…

  138. arXiv cs.CV TIER_1 · Pengna Li, Kangyi Wu, Shaoqing Xu, Fang Li, Hanbing Li, Lin Zhao, Kailin Lyu, Long Chen, Zhi-Xin Yang, Nanning Zheng ·

    SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

    arXiv:2604.27620v1 Announce Type: new Abstract: Vision-and-Language Navigation (VLN) aims to enable an embodied agent to follow natural-language instructions and navigate to a target location in unseen 3D environments. We argue that adapting VLMs to VLN requires endowing them wit…

  139. arXiv cs.CV TIER_1 · Hong-Tao Yu, Yuxin Peng, Serge Belongie, Xiu-Shen Wei ·

    Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

    arXiv:2504.14988v4 Announce Type: replace Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both h…

  140. arXiv cs.CV TIER_1 · Mingliang Liang, Zhuoran Liu, Arjen P. de Vries, Martha Larson ·

    Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

    arXiv:2604.27932v1 Announce Type: new Abstract: The computational cost of training a vision-language model (VLM) can be reduced by sampling the training data. Previous work on efficient VLM pre-training has pointed to the importance of semantic data balance, adjusting the distrib…

  141. arXiv cs.CV TIER_1 · Qingyi Si ·

    Online Self-Calibration Against Hallucination in Vision-Language Models

    Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from stronger models such as GPT. However, this offl…

  142. arXiv cs.CV TIER_1 · Kenneth J. K. Ong ·

    The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models

    As Vision-Language Models (VLMs) become increasingly integrated into decision-making systems, it is essential to understand how visual inputs influence their behavior. This paper investigates the effects of visual priming on VLMs' cooperative behavior using the Iterated Prisoner'…

  143. arXiv cs.CV TIER_1 · Martha Larson ·

    Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

    The computational cost of training a vision-language model (VLM) can be reduced by sampling the training data. Previous work on efficient VLM pre-training has pointed to the importance of semantic data balance, adjusting the distribution of topics in the data to improve VLM accur…

  144. arXiv cs.CV TIER_1 · Kibok Lee ·

    Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining

    Test-time prompt tuning (TPT) has emerged as a promising technique for enhancing the adaptability of vision-language models by optimizing textual prompts using unlabeled test data. However, prior studies have observed that TPT often produces poorly calibrated models, raising conc…

  145. arXiv cs.CV TIER_1 · Nanning Zheng ·

    SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation

    Vision-and-Language Navigation (VLN) aims to enable an embodied agent to follow natural-language instructions and navigate to a target location in unseen 3D environments. We argue that adapting VLMs to VLN requires endowing them with two complementary capabilities for acquiring s…

  146. arXiv cs.CV TIER_1 · Xiangjing An ·

    EdgeFM: Efficient Edge Inference for Vision-Language Models

    Vision-language models (VLMs) have demonstrated strong applicability in edge industrial applications, yet their deployment remains severely constrained by requirements for deterministic low latency and stable execution under resource limitations. Existing frameworks either rely o…

  147. arXiv cs.CV TIER_1 · Junwon You, Mihyun Jang, Sangwoo Mo, Jae-Hun Jung ·

    Topology-Aware Representation Alignment for Semi-Supervised Vision-Language Learning

    arXiv:2604.26370v1 Announce Type: new Abstract: Vision-language models have shown strong performance, but they often generalize poorly to specialized domains. While semi-supervised vision-language learning mitigates this limitation by leveraging a small set of labeled image-text …

  148. arXiv cs.CV TIER_1 · Zheng Liu, Mengjie Liu, Jingzhou Chen, Jingwei Xu, Bin Cui, Conghui He, Wentao Zhang ·

    FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

    arXiv:2504.09925v3 Announce Type: replace Abstract: We introduce FLARE, a family of vision language models (VLMs) with a fully vision-language alignment and integration paradigm. Unlike existing approaches that rely on single MLP projectors for modality alignment and defer cross-…

  149. arXiv cs.CV TIER_1 · Cyril Shih-Huan Hsu, Wig Yuan-Cheng Cheng, Chrysa Papagianni ·

    Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models

    arXiv:2604.26508v1 Announce Type: cross Abstract: Deploying Vision-Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed the capabilities of resource-constrained embedded platforms. Conversely, fully off…

  150. arXiv cs.CV TIER_1 · Junru Song, Yimeng Hu, Yijing Chen, Huining Li, Qian Li, Lizhen Cui, Yuntao Du ·

    Delineating Knowledge Boundaries for Honest Large Vision-Language Models

    arXiv:2604.26419v1 Announce Type: new Abstract: Large Vision-Language Models (VLMs) have achieved remarkable multimodal performance yet remain prone to factual hallucinations, particularly in long-tail or specialized domains. Moreover, current models exhibit a weak capacity to re…

  151. arXiv cs.CV TIER_1 · Chrysa Papagianni ·

    Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models

    Deploying Vision-Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed the capabilities of resource-constrained embedded platforms. Conversely, fully offloading inference to the cloud is often impractica…

  152. arXiv cs.CV TIER_1 · Yuntao Du ·

    Delineating Knowledge Boundaries for Honest Large Vision-Language Models

    Large Vision-Language Models (VLMs) have achieved remarkable multimodal performance yet remain prone to factual hallucinations, particularly in long-tail or specialized domains. Moreover, current models exhibit a weak capacity to refuse queries that exceed their parametric knowle…

  153. arXiv cs.CV TIER_1 · Jae-Hun Jung ·

    Topology-Aware Representation Alignment for Semi-Supervised Vision-Language Learning

    Vision-language models have shown strong performance, but they often generalize poorly to specialized domains. While semi-supervised vision-language learning mitigates this limitation by leveraging a small set of labeled image-text pairs together with abundant unlabeled images, e…

  154. arXiv cs.CV TIER_1 · Soroush Seifi, Vaggelis Dorovatas, Matteo Cassinelli, Fabien Despinoy, Daniel Olmeda Reino, Rahaf Aljundi ·

    Personalization Toolkit: Training Free Personalization of Large Vision Language Models

    arXiv:2502.02452v4 Announce Type: replace Abstract: Personalization of Large Vision-Language Models (LVLMs) involves customizing models to recognize specific users or object instances and to generate contextually tailored responses. Existing approaches rely on time-consuming trai…

  155. arXiv cs.CV TIER_1 · Yashwant Pravinrao Bangde, Debaditya Roy ·

    Instruction-Evidence Contrastive Dual-Stream Decoding for Grounded Vision-Language Reasoning

    arXiv:2604.25809v1 Announce Type: new Abstract: Vision-Language Models (VLMs) exhibit strong performance in instruction following and open-ended vision-language reasoning, yet they frequently generate fluent outputs that are weakly grounded in visual evidence. Prior works have sh…

  156. arXiv cs.CV TIER_1 · Debaditya Roy ·

    Instruction-Evidence Contrastive Dual-Stream Decoding for Grounded Vision-Language Reasoning

    Vision-Language Models (VLMs) exhibit strong performance in instruction following and open-ended vision-language reasoning, yet they frequently generate fluent outputs that are weakly grounded in visual evidence. Prior works have shown that instruction prompting further worsens t…

  157. arXiv cs.CV TIER_1 · Tairan Fu, Francisco Javier Santos-Mart\'in, Javier Conde, Pedro Reviriego, Elena Merino-G\'omez ·

    Lost in the Vibrations: Vision Language Models Fail the Dynamic Gauges Test

    arXiv:2604.22829v1 Announce Type: new Abstract: The digital transformation of industrial manufacturing increasingly relies on the ability of autonomous robots to interact with legacy infrastructure, particularly analog gauges. While Vision-Language Models (VLMs) have demonstrated…

  158. arXiv cs.CV TIER_1 · Brandon Collins, Logan Bolton, Hung Huy Nguyen, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen ·

    SketchVLM: Vision language models can annotate images to explain thoughts and guide users

    arXiv:2604.22875v1 Announce Type: new Abstract: When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision-language models (VLMs) such as Gemini-3-Pro and GPT-5 only respond with text, which can be difficu…

  159. arXiv cs.CV TIER_1 · Danae S\'anchez Villegas, Samuel Lewis-Lim, Nikolaos Aletras, Desmond Elliott ·

    Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

    arXiv:2604.14888v2 Announce Type: replace-cross Abstract: Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instructio…

  160. arXiv cs.CV TIER_1 · Ashwin Kumar, Robbie Holland, Corey Barrett, Jangwon Kim, Maya Varma, Zhihong Chen, Yunhe Gao, Greg Zaharchuk, Tara Taghavi, Krishnaram Kenthapadi, Akshay Chaudhari ·

    CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging

    arXiv:2604.22989v1 Announce Type: new Abstract: Recent medical multimodal foundation models are built as multimodal LLMs (MLLMs) by connecting a CLIP-pretrained vision encoder to an LLM using LLaVA-style finetuning. This two-stage, decoupled approach introduces a projection layer…

  161. arXiv cs.CV TIER_1 · Wenxuan Bao, Yanjun Zhao, Xiyuan Yang, Jingrui He ·

    Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

    arXiv:2604.21728v2 Announce Type: replace Abstract: Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labe…

  162. arXiv cs.CV TIER_1 · Baoyou Chen, Hanchen Xia, Peng Tu, Haojun Shi, Liwei Zhang, Weihao Yuan, Siyu Zhu ·

    BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

    arXiv:2604.16514v4 Announce Type: replace Abstract: Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directl…

  163. arXiv cs.CV TIER_1 · Shaotian Li, Shangze Li, Chuancheng Shi, Wenhua Wu, Yanqiu Wu, Xiaohan Yu, Fei Shen, Tat-Seng Chua ·

    Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models

    arXiv:2604.07802v2 Announce Type: replace Abstract: Large-scale vision-language models (VLMs) exhibit remarkable zero-shot capabilities, yet the internal mechanisms driving their anomaly detection (AD) performance remain poorly understood. Current methods predominantly treat VLMs…

  164. arXiv cs.CV TIER_1 Deutsch(DE) · Kanishk Jain, Qian Yang, Shravan Nayak, Parisa Kordjamshidi, Nishanth Anand, Aishwarya Agrawal ·

    Discovering Failure Modes in Vision-Language Models using RL

    arXiv:2604.04733v2 Announce Type: replace Abstract: Vision-language Models (VLMs), despite achieving strong performance on multimodal benchmarks, often misinterpret straightforward visual concepts that humans identify effortlessly, such as counting, spatial reasoning, and viewpoi…

  165. arXiv cs.CV TIER_1 · Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva ·

    LinguDistill: Recovering Linguistic Ability in Vision-Language Models via Selective Cross-Modal Distillation

    arXiv:2604.00829v3 Announce Type: replace Abstract: Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such l…

  166. arXiv cs.CV TIER_1 · Myeongkyun Kang, Soopil Kim, Xiaoxiao Li, Sang Hyun Park ·

    Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

    arXiv:2603.19482v2 Announce Type: replace Abstract: Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curate…

  167. arXiv cs.CV TIER_1 · Soumyaratna Debnath, Bui Duc Manh, Zinan Liu, Lin Wang ·

    LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

    arXiv:2603.14882v2 Announce Type: replace Abstract: Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither unifo…

  168. arXiv cs.CV TIER_1 Italiano(IT) · Woojun Jung, Jaehoon Go, Mingyu Jeon, Sunjae Yoon, Junyeong Kim ·

    Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

    arXiv:2512.10362v2 Announce Type: replace Abstract: Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop s…

  169. arXiv cs.CV TIER_1 · Zikun Guo, Jingwei Lv, Xinyue Xu, Shu Yang, Jun Wen, Di Wang, Lijie Hu ·

    Benchmarking and Mitigating Sycophancy in Medical Vision Language Models

    arXiv:2509.21979v4 Announce Type: replace Abstract: Visual language models (VLMs) have the potential to transform medical workflows. However, the deployment is limited by sycophancy. Despite this serious threat to patient safety, a systematic benchmark remains lacking. This paper…

  170. arXiv cs.CV TIER_1 · Zongxia Li, Wenhao Yu, Chengsong Huang, Zhenwen Liang, Rui Liu, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, Dong Yu ·

    Self-Rewarding Vision-Language Model via Reasoning Decomposition

    arXiv:2508.19652v2 Announce Type: replace Abstract: Vision-Language Models (VLMs) often suffer from visual hallucinations: generating things that are not consistent with visual inputs and language shortcuts, where they skip the visual part and just rely on text priors. These issu…

  171. arXiv cs.CV TIER_1 · Hanqi Yan, Xiangxiang Cui, Lu Yin, Jindong Gu, Paul Pu Liang, Yulan He, Yifei Wang ·

    Beyond Cross-Modal Alignment: Measuring and Leveraging Modality Gap in Vision-Language Models

    arXiv:2502.14888v4 Announce Type: replace Abstract: The success of vision-language models is primarily attributed to effective alignment across modalities such as vision and language. However, modality gaps persist in existing alignment algorithms and appear necessary for human p…

  172. arXiv cs.CV TIER_1 (CA) · Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishn ·

    NVILA: Efficient Frontier Visual Language Models

    arXiv:2412.04468v3 Announce Type: replace Abstract: Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to jointly optimiz…

  173. arXiv cs.CV TIER_1 · Fan Du, Feng Yan, Jianxiong Wu, Xinrun Xu, Weiye Zhang, Weinong Wang, Yu Guo, Bin Qian, Zhihai He ·

    CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

    arXiv:2604.24622v1 Announce Type: new Abstract: Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian n…

  174. arXiv cs.CV TIER_1 · Lixian Chen, Mingxuan Huang, Yanhui Chen, Junyi Lin, Yang Shi ·

    Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift

    arXiv:2604.24602v1 Announce Type: new Abstract: Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while…

  175. arXiv cs.CV TIER_1 · Rinyoichi Takezoe, Yaqian Li, Zihao Bo, Anzhou Hou, Mo Guang, Kaiwen Long ·

    LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

    arXiv:2604.23950v1 Announce Type: new Abstract: Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address …

  176. arXiv cs.CV TIER_1 · Yingqian Min, Kun Zhou, Yifan Li, Yuhuan Wu, Han Peng, Yifan Du, Wayne Xin Zhao, Min Yang, Ji-Rong Wen ·

    Improving Vision-language Models with Perception-centric Process Reward Models

    arXiv:2604.24583v1 Announce Type: new Abstract: Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnos…

  177. arXiv cs.CV TIER_1 · Zhihai He ·

    CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

    Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade…

  178. arXiv cs.CV TIER_1 · Yang Shi ·

    Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift

    Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modalit…

  179. arXiv cs.CV TIER_1 · Ji-Rong Wen ·

    Improving Vision-language Models with Perception-centric Process Reward Models

    Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain.…

  180. arXiv cs.CV TIER_1 · Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, Lijuan Wang ·

    V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models

    arXiv:2504.06148v3 Announce Type: replace Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in visual-text processing. However, existing static image-text benchmarks are insufficient for evaluating their dynamic pe…

  181. arXiv cs.CV TIER_1 · Kaiwen Long ·

    LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

    Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this issue by pruning unimportant visual tokens,…

  182. arXiv cs.CV TIER_1 · Jingrui He ·

    Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

    Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. …

  183. arXiv cs.CV TIER_1 · Mitesh M. Khapra ·

    Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

    Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains…

  184. arXiv cs.CV TIER_1 · Rongrong Ji ·

    Prototype-Based Test-Time Adaptation of Vision-Language Models

    Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce …

  185. dev.to — Anthropic tag TIER_1 · Jangwook Kim ·

    Claude Opus 4.7: High-Res Vision, Task Budgets, and Agentic Coding

    <p>Anthropic released Claude Opus 4.7 on April 16, 2026. Three things make this release worth paying attention to if you were on Opus 4.6 and wondering whether it was time to upgrade: a significant jump in image resolution support, a new task budget mechanism for agentic loops, a…

  186. HN — machine learning stories TIER_1 · 2bit ·

    FastVLM: Efficient Vision Encoding for Vision Language Models

  187. dev.to — LLM tag TIER_1 · 丁久 ·

    Multimodal AI Models: Vision, Audio, and Text

    <blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/multimodal-models.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</em></…

  188. dev.to — LLM tag TIER_1 · 丁久 ·

    Building Multimodal AI Applications: Vision, Audio, and Text Combined (2026)

    <blockquote> <p><em>This article was originally published on <a href="https://dingjiu1989-hue.github.io/en/ai/multimodal-ai-guide.html" rel="noopener noreferrer">AI Study Room</a>. For the full version with working code examples and related articles, visit the original post.</em>…

  189. r/MachineLearning TIER_1 · /u/Nice-Dragonfly-4823 ·

    How Visual-Language-Action (VLA) Models Work [D]

    <table> <tr><td> <a href="https://www.reddit.com/r/MachineLearning/comments/1svhwtz/how_visuallanguageaction_vla_models_work_d/"> <img alt="How Visual-Language-Action (VLA) Models Work [D]" src="https://external-preview.redd.it/fBpt1C8zS6YDW2Lp0_fnNCU2C0Dw1W3tzt7P4g39SHw.jpeg?wid…