VisReflect 框架改进了 LVLM 在长上下文中的细粒度感知能力

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-29 13:30

研究人员推出了一种名为 VisReflect 的新框架，旨在提高大型视觉语言模型 (LVLM) 在处理高分辨率图像和长视频时的细粒度感知能力。该方法解决了“视觉注意力沉陷现象”的挑战，即不相关的视觉标记会主导模型的注意力。VisReflect 利用潜在视觉反射，在单次前向传播中引导注意力集中于显著区域或帧，避免了对裁剪视觉区域进行重新编码的计算开销。在 BLINK、HRBench-4K/8K、MVBench、VideoMME 和 MLVU 等基准测试上的评估显示，性能显著提升，图像任务提升 4.1%，视频任务提升 1.8%，同时与现有的基于缩放的方法相比，视频理解的推理时间减少了约 44%。 AI

影响增强了 LVLM 在复杂视觉任务中的细粒度感知能力，有望改进图像和视频分析应用。

排序理由该集群描述了一篇详细介绍用于改进 AI 模型性能的新框架的研究论文。

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.CV TIER_1 English(EN) · Xiaoqian Shen, Mohamed Elhoseiny · 2026-06-30 04:00

VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context

arXiv:2606.30288v1 Announce Type: new Abstract: Large Vision Language Models (LVLMs) have achieved remarkable success on vision-language tasks, yet fine-grained perception over high-resolution images and long-context videos remains challenging. As the number of visual tokens incr…
arXiv cs.CV TIER_1 English(EN) · Mohamed Elhoseiny · 2026-06-29 13:30

VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context

Large Vision Language Models (LVLMs) have achieved remarkable success on vision-language tasks, yet fine-grained perception over high-resolution images and long-context videos remains challenging. As the number of visual tokens increases, the visual attention sink phenomenon beco…

报道来源 [2]

VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context

VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context

相关实体

相关话题