Large Vision Language Models
PulseAugur coverage of Large Vision Language Models — every cluster mentioning Large Vision Language Models across labs, papers, and developer communities, ranked by signal.
5 天有情绪数据
-
新的解码方法解决了视觉语言模型中的幻觉问题
研究人员开发了一种名为 CHASd 的新推理时框架,以对抗大型视觉语言模型 (LVLMs) 中的幻觉。该方法,即对比幻觉感知分步解码,仅在 token 预测置信度低时选择性地激活对比解码分支。它使用由注意力引导的局部视觉扰动来最小化对有用视觉证据的干扰,在多个基准测试上提高了幻觉指标,同时保持了高效的推理。
-
新的COCOTree数据集支持分层视觉分解
研究人员推出了COCOTree,这是一个用于开放树状视觉分解任务的新数据集和基准。该任务涉及将图像分割成具有灵活粒度的视觉组件的层次树。该数据集是使用一种新颖的流程生成的,该流程结合了大型视觉语言模型和SAM 3,用于语义推理和几何基础,产生了超过2.1K张图像和1.8M个结构节点,拥有3.5K个标签的开放词汇表。还提出了一种新的评估指标Open Tree Quality (OTQ),用于评估掩码精度、标签准确性和结构一致性。
-
New MedFocus method improves LVLM visual attribution for medical imaging
Researchers have developed a new framework to evaluate how well Large Vision Language Models (LVLMs) can ground their reasoning in visual evidence, particularly for chest X-ray analysis. Existing attribution methods oft…
-
SetCon advances referring segmentation with set-level concept prediction
Researchers have introduced SetCon, a novel approach to open-ended referring segmentation that treats multiple targets as a coherent set rather than individual outputs. This method reformulates the problem as explicit s…
-
SplitQ framework enhances low-bit quantization for vision-language models
Researchers have developed SplitQ, a new post-training quantization framework designed to improve the efficiency of large vision-language models (VLMs) on devices with limited resources. SplitQ addresses the accuracy de…
-
New attack manipulates vision-language models via image-only prompts
Researchers have developed a new cross-modal prompt injection attack called CrossMPI that can manipulate the interpretation of both text and image inputs in large vision-language models (LVLMs) through image-only pertur…
-
New method enhances vision-language models with group revision
Researchers have introduced a new group-revision optimization paradigm to improve object-level grounding in large vision-language models. This method addresses the limitations of sparse, response-level rewards in existi…
-
SteerSeg framework improves video segmentation using steered attention maps
Researchers have developed SteerSeg, a new framework designed to improve video segmentation by addressing issues with attention maps generated by large vision-language models. These models often produce diffuse or ambig…
-
New framework estimates LVLM confidence by contrasting image-based predictions
Researchers have developed a new framework called BICR (Blind-Image Contrastive Ranking) to assess the confidence of Large Vision-Language Models (LVLMs). This method helps distinguish between predictions genuinely info…
-
Composer framework advances aesthetic image generation via composition transfer
Researchers have developed Composer, a new framework designed to improve the aesthetic quality of generated images by explicitly modeling composition. This approach separates composition from semantics, allowing for com…
-
New VIDA dataset tackles ambiguity in multimodal machine translation
Researchers have introduced VIDA, a new dataset designed to tackle ambiguity in multimodal machine translation. The dataset contains 2,500 instances where visual context is crucial for resolving ambiguous expressions. E…