PulseAugur
实时 14:27:52
English(EN) Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior?

新研究探讨VLM对视觉说服和影响的敏感性

研究人员正在开发新的框架来评估视觉-语言模型(VLM)对多模态说服和视觉影响的敏感性。一项研究引入了MMPersuade,使用图像和心理策略来测试代理之间的说服,发现多模态输入比单独的文本更有效,并且敏感性因领域和模型架构而异。另一篇论文提出了一种系统地扰动图像并分析VLM视觉偏好如何变化的方法,旨在揭示漏洞并改进审计。第三项研究侧重于自动驾驶中的视觉-语言-动作(VLA)模型,使用扰动框架来理解视觉信息如何支撑驾驶行为并开发更安全的系统。 AI

影响 这些研究突显了多模态AI系统中的关键漏洞,为在各种应用中开发更强大、更值得信赖的AI代理提供了信息。

排序理由 多篇arXiv论文介绍了用于评估VLM和VLA模型行为的新框架和分析。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →

报道来源 [4]

  1. arXiv cs.CL TIER_1 English(EN) · Haoyi Qiu, Yilun Zhou, Pranav Narayanan Venkit, Kung-Hsiang Huang, Jiaxin Zhang, Nanyun Peng, Chien-Sheng Wu ·

    眼见为实?评估视觉语言模型在代理间多模态说服中的易感性

    arXiv:2510.22768v2 Announce Type: replace Abstract: As autonomous agents increasingly interact, they inevitably attempt to influence one another. While prior work in text-only settings has explored the dynamics of Agent-to-Agent (A2A) persuasion, the rise of Vision-Language Model…

  2. arXiv cs.AI TIER_1 English(EN) · Manuel Cherep, Pranav M R, Pattie Maes, Nikhil Singh ·

    视觉说服力:什么影响了视觉语言模型的决策?

    arXiv:2602.15278v2 Announce Type: replace-cross Abstract: The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recom…

  3. arXiv cs.AI TIER_1 English(EN) · Jingtao He, Hongliang Lu, Xiaoyun Qiu, Yixuan Wang, Xinhu Zheng ·

    视觉信息在视觉-语言-动作模型驾驶行为中起决定性作用吗?

    arXiv:2605.31041v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models have demonstrated promising capability in autonomous driving, highlighting the potential of unified multimodal architectures for jointly modeling perception and planning. However, how current VL…

  4. arXiv cs.CV TIER_1 English(EN) · Xinhu Zheng ·

    视觉信息在视觉-语言-动作模型驾驶行为中起决定性作用吗?

    Vision-Language-Action (VLA) models have demonstrated promising capability in autonomous driving, highlighting the potential of unified multimodal architectures for jointly modeling perception and planning. However, how current VLA-based driving behavior is grounded in visual inf…