PulseAugur
实时 21:55:27
实体 vision-language model

vision-language model

PulseAugur coverage of vision-language model — every cluster mentioning vision-language model across labs, papers, and developer communities, ranked by signal.

Show in brief
总计 · 30天
111
90 天内 111
发布 · 30天
0
90 天内 0
论文 · 30天
107
90 天内 107
层级分布 · 90 天
关系
时间线
  1. 2026-05-19 research_milestone A new method is proposed to improve out-of-distribution visual document understanding in VLMs. 来源
情绪 · 30 天

17 天有情绪数据

最近 · 第 5/6 页 · 共 111 条
  1. RESEARCH · CL_22022 ·

    DexSim2Real uses foundation models to bridge sim-to-real gap in robotics

    Researchers have developed DexSim2Real, a new framework that uses foundation models to improve the transfer of robotic manipulation skills from simulation to the real world. The system incorporates a vision-language mod…

  2. RESEARCH · CL_13548 ·

    AI 进展涵盖 XQuery 转换、OCR 管道和中国的基准挑战

    一个名为 SGOCR 2026 的新开源管道已发布,旨在生成用于训练视觉语言模型(VLM)的空间感知 OCR 数据集。该管道旨在将文本本地化与语义推理分开,填补了当前 VLM 训练数据的空白。此外,关于使用本地 LLM 将 XQuery 转换为 SQL 的讨论正在进行中,关于是否需要微调,或者混合解析和提示工程是否足够。另外,中国的 AI 进展,特别是来自 DeepSeek 的进展,正在挑战该领域美国领先的说法,政府支持和成本效益高的…

  3. RESEARCH · CL_11851 ·

    New framework uses VLM distillation for stable continual model adaptation

    Researchers have introduced Test-Time Distillation (TTD), a novel approach to address performance degradation in deep neural networks due to distribution shifts during deployment. Existing methods often suffer from pred…

  4. RESEARCH · CL_11825 ·

    Vision-language models mistake head orientation for gaze direction

    Researchers have discovered that Vision-Language Models (VLMs) struggle to accurately infer human gaze direction, often mistaking head orientation for eye movement. In a study involving 1,360 real-world images, VLMs sho…

  5. RESEARCH · CL_11793 ·

    OmniDrive-R1 enhances autonomous driving VLMs with reinforcement-driven visual grounding

    Researchers have introduced OmniDrive-R1, a novel framework for autonomous driving that integrates perception and reasoning using an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. This approach addresses ob…

  6. RESEARCH · CL_11758 ·

    OpAgent achieves 71.6% success rate in web navigation tasks

    Researchers have developed OpAgent, a novel web navigation agent that utilizes online reinforcement learning to overcome the limitations of static datasets. The agent employs a hierarchical multi-task fine-tuning approa…

  7. RESEARCH · CL_22533 ·

    AI草稿提升音频描述质量,但质量阈值是关键

    研究人员开发了改进音频描述(AD)生成和评估的质量与可扩展性的方法。一项研究介绍了GenAD和RefineAD,这是一个利用AI生成的草稿来显著缩短AD创作时间的流程和界面,前提是草稿达到一定的质量阈值。另一篇论文提出了一种使用项目反应理论来评估人类和视觉语言模型(VLM)评分者在AD质量控制方面的熟练程度的工作流程,发现顶级的VLM可以接近人类评分水平,但缺乏类似人类的推理能力。第三项研究强调了零样本VLM安全分类器由于提示引起的得…

  8. RESEARCH · CL_10251 ·

    MARVIS system uses VLM reasoning over visualizations for predictive tasks

    Researchers have developed MARVIS, a novel system that enhances the reasoning capabilities of large language and vision-language models (VLMs) by converting their latent embeddings into visual representations. This appr…

  9. RESEARCH · CL_10151 ·

    ChartVerse 框架为 VLMs 合成复杂的图表和推理数据

    研究人员推出 ChartVerse,一个旨在为视觉语言模型 (VLMs) 生成复杂图表和可靠问答数据的新框架。该系统通过使用一种称为 Rollout Posterior Entropy 的新颖指标合成多样化、高复杂度的图表,解决了现有数据集的局限性。为确保准确性,ChartVerse 采用了一种基于事实的逆向 QA 合成方法,在生成问题和验证一致性之前直接从源代码提取答案。由此产生的 ChartVerse-8B 模型展示了最先进的性能…

  10. RESEARCH · CL_10145 ·

    New benchmark and framework assess VLM robustness and ethical consistency

    Researchers have developed a new benchmark, DIQ-H, to evaluate the robustness of Vision-Language Models (VLMs) under adversarial visual conditions and temporal inconsistencies. This benchmark simulates real-world stress…

  11. RESEARCH · CL_10039 ·

    WorldArena benchmark evaluates world models for functional utility beyond video generation

    Researchers from Tsinghua University have introduced WorldArena, a novel evaluation framework designed to assess the functional utility of world models, moving beyond mere visual realism. The framework addresses a criti…

  12. RESEARCH · CL_08577 ·

    新框架MCM-VG和DEGround推动零样本3D视觉基础研究

    研究人员开发了两个新框架DEGround和MCM-VG,以改进以自我为中心的3D视觉基础(ego-centric 3D visual grounding),这是具身智能的关键任务。DEGround利用一个同质化管道,在检测和基础之间共享对象表示,提高了效率和性能。MCM-VG通过建立多个一致的2D-3D映射来实现精确的定位并减少空间冗余,从而解决了零样本3D视觉基础的挑战。这两种方法在各种基准测试中都取得了最先进的结果,显著优于以前的方法。

  13. RESEARCH · CL_08207 ·

    HuM-Eval 框架改进视频生成质量评估

    研究人员开发了 HuM-Eval,一个旨在更好地评估生成视频中人类运动质量的新框架。该系统采用粗粒度到细粒度的策略,首先使用视觉语言模型进行广泛评估,然后对姿势和运动稳定性进行详细分析。据报道,HuM-Eval 与人类判断的相关性达到 58.2%,超过了现有方法。该团队还推出了 HuM-Bench,一个包含 1000 个提示的基准数据集,以帮助评估文本到视频模型。

  14. RESEARCH · CL_11695 ·

    New LLM techniques and benchmarks advance 3D indoor scene generation

    Researchers have developed new methods for generating 3D indoor scenes using AI, addressing challenges like spatial errors and data scarcity. One approach, SpatialGrammar, introduces a domain-specific language to repres…

  15. RESEARCH · CL_08218 ·

    VLM 在多模态评估中表现出任务依赖性不确定性,影响评分可靠性。

    一篇新论文引入了保形预测,用于评估视觉语言模型(VLM)作为多模态系统的自动化裁判时的可靠性。研究表明,VLM 评估中的不确定性高度依赖于特定任务,与图像美学相比,数学推理任务显示出明显更宽、信息量更少的预测区间。这项工作还发现了一个关键问题,称为“排名-评分解耦”,即 VLM 可以准确地对响应进行排名,但无法提供可靠的绝对分数,这凸显了对更鲁棒的评估方法的需求。

  16. RESEARCH · CL_07017 ·

    New training methods boost VLM mobile agents' interactive and safety capabilities

    Researchers have developed two new approaches for enhancing the capabilities of vision-language model (VLM)-based mobile agents. Mobile-R1 introduces a hierarchical curriculum to improve exploration and self-correction,…

  17. RESEARCH · CL_06938 ·

    Researchers unveil defenses against AR-LLM social engineering attacks

    Researchers have developed two new frameworks to combat social engineering attacks that leverage augmented reality (AR) and large language models (LLMs). The first, PhySE, uses a visual language model for rapid profile …

  18. RESEARCH · CL_06799 ·

    AI利用滞后优化金融时间序列咨询

    研究人员开发了滞后偏好优化(HPO)方法,这是一种训练语言模型提供金融时间序列咨询的新颖方法。该技术利用了强化学习原理,特别是使用观察到的结果来生成偏好对进行训练,而无需人工标注。将其应用于一个拥有40亿参数的模型以处理S&P 500股票时间序列,HPO在准确性和咨询质量方面均优于其更大的教师模型。

  19. RESEARCH · CL_06646 ·

    Researchers develop multimodal QUD for deeper scientific figure comprehension

    Researchers have developed a new dataset and methodology called MQUD to enable Vision-Language Models (VLMs) to ask more insightful questions about scientific figures. This approach extends the linguistic theory of Ques…

  20. RESEARCH · CL_06562 ·

    GA2-CLIP paper introduces generic attribute anchors for VLM prompt tuning

    Researchers have developed GA2-CLIP, a novel framework designed to enhance the generalization capabilities of Vision-Language Models (VLMs) in video tasks. This plug-and-play method addresses the issue of semantic space…