Vision Language Models
PulseAugur coverage of Vision Language Models — every cluster mentioning Vision Language Models across labs, papers, and developer communities, ranked by signal.
6 天有情绪数据
-
VisAnalog 套件测试 AI 模型中的视觉概念迁移能力
研究人员推出 VisAnalog,这是一个新的诊断套件,旨在评估视觉模型在不同图像和变换之间迁移概念的能力。该基准测试包含 617 个经过人类验证的问题,通过旋转、翻转和颜色变化等步骤测试模型识别和操纵视觉属性的能力。对各种视觉语言模型的初步测试显示,与人类表现相比,准确率显著降低,尤其是在变换复杂度增加的情况下,这表明关系推理是主要瓶颈。
-
大型语言模型(LLM)融合数据和控制平面,带来新的安全风险
大型语言模型(LLM)本质上模糊了数据和控制之间的界限,为基础设施工程师和机器学习运维人员带来了重大的安全挑战。与传统计算不同,LLM缺乏明确的数据平面,这意味着其上下文窗口内的所有信息,无论是提示、文档,还是图像中隐藏的指令,都被视为可执行命令。这种架构缺陷允许不受信任的工件影响模型行为,可能导致绕过数据库安全或更改工程计算等漏洞。
-
研究发现:视觉语言模型在空间数值理解方面存在困难
一个名为SpaceNum的新研究框架已被开发出来,用于评估视觉语言模型(VLMs)在多大程度上理解空间数值概念。研究发现,当前的VLMs在很大程度上未能将数值输出与空间感知联系起来,其表现常常处于随机猜测水平。这些模型倾向于依赖表面的空间线索,并在坐标感知表示和从视觉数据中抽象结构化布局方面遇到困难。
-
新的DDX-TRACE基准测试评估视觉语言模型医学诊断轨迹
研究人员推出DDX-TRACE,这是一个旨在评估视觉语言模型(VLMs)在医学背景下诊断推理能力的新基准测试。与仅关注最终答案的现有基准测试不同,DDX-TRACE评估整个诊断轨迹,包括模型如何在连续步骤中请求证据、更新鉴别诊断以及管理不确定性。对最先进的VLMs进行的初步评估显示出显著的不足,表明模型可以在不展示健全的临床推理或高效证据收集的情况下获得最终诊断的高分。
-
新框架提升VLMs在自动驾驶汽车中的异常检测能力
研究人员开发了SAVANT,一个旨在利用视觉语言模型(VLMs)改进自动驾驶系统中语义异常检测的新框架。SAVANT将异常检测重新构建为分层语义一致性验证,增强了现有VLMs识别罕见、分布外驾驶场景的能力。与标准提示方法相比,该框架的召回率提高了约18.5%,并实现了约10,000张真实图像的自动标注。通过使用这个精选数据集,一个微调的7B开源模型在单次异常检测中实现了90.8%的召回率和93.8%的准确率,为该领域的数据稀缺问题提供…
-
新研究利用新颖的专家混合方法解决大型语言模型的持续学习问题
两篇新研究论文提出了在大型语言模型和视觉-语言模型中进行持续学习的新颖方法,旨在减轻灾难性遗忘。CP-MoE引入了一个瞬时专家来指导更新和保留知识,而MoRAM则利用细粒度的秩-1适配器作为记忆单元来实现内容可寻址检索。与现有的专家混合技术相比,这两种方法在基准测试中都展示了改进的性能,提供了更好的可塑性和稳定性之间的权衡。
-
新基准测试 AI 从隐式人类意图中进行导航
研究人员推出了 IntentionNav,这是一个新的基准,旨在测试具身 AI 代理根据隐式人类指令进行导航和查找对象的能力。与指定目标对象的先前基准不同,IntentionNav 要求代理从自由文本意图中推断出对象,例如需要某物来加热食物。该基准包含 176 个模拟场景中的 500 个意图,评估表明当前模型在目标推断和任务成功完成方面存在困难,突显了间接人类意图是一个重大的瓶颈。
-
新研究对VLM的注视理解进行基准测试并加以改进
研究人员开发了新的方法来评估和改进视觉语言模型(VLMs)对人类注视的理解。一项研究引入了EyeVLM,一个用于对VLMs进行注视跟随和社会注视预测基准测试的框架,发现当前模型缺乏精确的理解。另一篇论文提出了一种新颖的训练机制,使用局部LoRA和视锥外惩罚来增强视觉基础模型在注视跟随任务中的注视推理能力,取得了最先进的结果。
-
Open-source VLMs evaluated for grocery product retrieval accuracy
A new paper evaluates 190 open-source vision-language models (VLMs) on the task of grocery product retrieval, a crucial component for checkout-free retail. The research found that data quality is more important than mod…
-
Vision-language models fail at basic path following tasks
Researchers have identified a significant failure mode in vision-language models (VLMs) related to visual path following. Even advanced VLMs struggle to consistently trace a designated path, frequently switching to near…
-
Vision Mamba models show promise for AI-generated image detection
A new research paper investigates the effectiveness of Vision Mamba models in detecting AI-generated images. The study systematically evaluates various Vision Mamba architectures against established methods like CNNs, V…
-
MolDeTox benchmark evaluates LLMs for molecular detoxification in drug discovery
Researchers have introduced MolDeTox, a new benchmark designed to evaluate the capabilities of large language models (LLMs) and vision-language models (VLMs) in molecular detoxification. This benchmark addresses limitat…
-
GridProbe cuts VLM compute cost for long videos
Researchers have developed GridProbe, a novel method to improve the efficiency of long-video Visual Language Models (VLMs). This technique adaptively selects relevant frames during inference, reducing the computational …
-
AI视频生成骗过模型但骗不过人类,新基准测试显示
研究人员推出了VideoASMR-Bench,这是一个旨在评估AI模型区分真实和AI生成的ASMR(自发性知觉经络反应)视频能力的新基准。该基准包含一个真实ASMR视频及其由各种模型生成的对应合成视频的数据集,以及一个评估框架,该框架在一个对抗性游戏中将视频生成模型与视频理解模型进行较量。包括Google的Gemini-3-Pro在内的当前最先进模型,在可靠检测AI生成的ASMR内容方面存在困难,这表明在细粒度的视听感知能力方面存在差距。
-
GeoStack framework enables efficient VLM knowledge composition, preventing catastrophic forgetting.
Researchers have developed GeoStack, a novel framework designed to enhance knowledge composition in Vision-Language Models (VLMs). This approach addresses the issue of catastrophic forgetting, where models lose previous…
-
SpecPL paper introduces spectral granularity for prompt learning in VLMs
Researchers have introduced SpecPL, a novel approach to prompt learning for Vision-Language Models (VLMs) that addresses modality asymmetry by focusing on spectral granularity. This method decomposes visual signals into…
-
BareBones benchmark reveals Vision-Language Models suffer texture bias cliff
Researchers have introduced BareBones, a new benchmark designed to test the geometric comprehension abilities of Vision-Language Models (VLMs). The benchmark uses pixel-level silhouettes to evaluate if VLMs can understa…
-
VISTA benchmark launched for advanced VLM spatio-temporal interaction analysis
Researchers have introduced VISTA, a new benchmark designed to evaluate the spatio-temporal understanding capabilities of Vision-Language Models (VLMs). Unlike existing benchmarks that focus on simple actions and limite…
-
Researchers propose Gromov-Wasserstein distance for VLM vision encoder selection
Researchers have developed a new method for selecting optimal vision encoders for Vision-Language Models (VLMs). Traditional approaches, like choosing encoders with high accuracy or large size, were found to be ineffect…
-
New framework enhances multimodal in-context learning with inductive-deductive reasoning
Researchers have developed a new framework to improve in-context learning for vision-language models (VLMs). The approach addresses an "inductive gap" where models may reach correct answers through flawed reasoning and …