ENTITY vision-language model

vision-language model

PulseAugur coverage of vision-language model — every cluster mentioning vision-language model across labs, papers, and developer communities, ranked by signal.

Show in brief

Total · 30d

195

195 over 90d

Releases · 30d

0 over 90d

Papers · 30d

188

188 over 90d

TIER MIX · 90D

significant 1
research 87
tool 103
commentary 4

TOPICS

paper 188
model release 61
product 57
other 52
safety 40
infra 7

RELATIONSHIPS

instance of Vision Language Models 90%
instance of VSI-Bench 90%
instance of MLLMs 90%
used by autonomous driving 80%
instance of foundation model 70%
instance of Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond 70%
instance of multimodal large language model 70%
used by VSI-Bench 70%
used by foundation model 60%
affiliated with autonomous driving 50%

TIMELINE

2026-05-19 research_milestone A new method is proposed to improve out-of-distribution visual document understanding in VLMs. source

SENTIMENT · 30D

25 day(s) with sentiment data

RECENT · PAGE 9/10 · 195 TOTAL

RESEARCH · CL_16299 · May 4 · 13:49

Coral and CoRAL systems optimize LLM serving and robotic control

Researchers have developed two distinct systems named Coral and CoRAL. Coral is an adaptive system designed for cost-efficient serving of multiple large language models across heterogeneous cloud GPUs, aiming to optimiz…
RESEARCH · CL_16304 · May 4 · 12:27

Robots gain semantic understanding with VLM and adaptive memory

Researchers have developed a "Semantic Autonomy Stack" to enable indoor mobile robots to understand natural language instructions, overcoming the latency and memory limitations of current Vision-Language Models (VLMs). …
RESEARCH · CL_14362 · May 4 · 04:00

GeoThinker framework actively integrates geometry for advanced spatial reasoning

Researchers have developed GeoThinker, a novel framework that enhances spatial reasoning in multimodal large language models (MLLMs) by actively integrating geometric information. Unlike previous passive fusion methods,…
RESEARCH · CL_21819 · May 3 · 22:46

New benchmarks tackle 'Entity Identity Confusion' in LLM knowledge editing

Researchers have identified a new failure mode in multimodal knowledge editing called Entity Identity Confusion (EIC), where edited vision-language models incorrectly associate new entity information with original image…
RESEARCH · CL_22022 · May 3 · 17:29

DexSim2Real uses foundation models to bridge sim-to-real gap in robotics

Researchers have developed DexSim2Real, a new framework that uses foundation models to improve the transfer of robotic manipulation skills from simulation to the real world. The system incorporates a vision-language mod…
RESEARCH · CL_13548 · May 3 · 08:43

AI advancements span XQuery conversion, OCR pipelines, and China's benchmark challenges

A new open-source pipeline called SGOCR 2026 has been released, designed to generate spatially-grounded OCR datasets for training vision-language models. This pipeline aims to separate text localization from semantic re…
RESEARCH · CL_11851 · May 1 · 04:00

New framework uses VLM distillation for stable continual model adaptation

Researchers have introduced Test-Time Distillation (TTD), a novel approach to address performance degradation in deep neural networks due to distribution shifts during deployment. Existing methods often suffer from pred…
RESEARCH · CL_11825 · May 1 · 04:00

Vision-language models mistake head orientation for gaze direction

Researchers have discovered that Vision-Language Models (VLMs) struggle to accurately infer human gaze direction, often mistaking head orientation for eye movement. In a study involving 1,360 real-world images, VLMs sho…
RESEARCH · CL_11793 · May 1 · 04:00

OmniDrive-R1 enhances autonomous driving VLMs with reinforcement-driven visual grounding

Researchers have introduced OmniDrive-R1, a novel framework for autonomous driving that integrates perception and reasoning using an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. This approach addresses ob…
RESEARCH · CL_11758 · May 1 · 04:00

OpAgent achieves 71.6% success rate in web navigation tasks

Researchers have developed OpAgent, a novel web navigation agent that utilizes online reinforcement learning to overcome the limitations of static datasets. The agent employs a hierarchical multi-task fine-tuning approa…
RESEARCH · CL_22533 · May 1 · 01:06

AI drafts boost audio description quality, but quality threshold is key

Researchers have developed methods to improve the quality and scalability of audio description (AD) generation and evaluation. One study introduces GenAD and RefineAD, a pipeline and interface that uses AI-generated dra…
RESEARCH · CL_10251 · Apr 30 · 04:00

MARVIS system uses VLM reasoning over visualizations for predictive tasks

Researchers have developed MARVIS, a novel system that enhances the reasoning capabilities of large language and vision-language models (VLMs) by converting their latent embeddings into visual representations. This appr…
RESEARCH · CL_10151 · Apr 30 · 04:00

ChartVerse framework synthesizes complex charts and reasoning data for VLMs

Researchers have introduced ChartVerse, a new framework designed to generate complex charts and reliable question-answering data for Vision Language Models (VLMs). This system addresses limitations in existing datasets …
RESEARCH · CL_10145 · Apr 30 · 04:00

New benchmark and framework assess VLM robustness and ethical consistency

Researchers have developed a new benchmark, DIQ-H, to evaluate the robustness of Vision-Language Models (VLMs) under adversarial visual conditions and temporal inconsistencies. This benchmark simulates real-world stress…
RESEARCH · CL_10039 · Apr 30 · 02:46

WorldArena benchmark evaluates world models for functional utility beyond video generation

Researchers from Tsinghua University have introduced WorldArena, a novel evaluation framework designed to assess the functional utility of world models, moving beyond mere visual realism. The framework addresses a criti…
RESEARCH · CL_08577 · Apr 29 · 03:38

New frameworks MCM-VG and DEGround advance zero-shot 3D visual grounding

Researchers have developed two new frameworks, DEGround and MCM-VG, to improve ego-centric 3D visual grounding, a key task for embodied intelligence. DEGround utilizes a homogeneous pipeline that shares object represent…
RESEARCH · CL_08207 · Apr 28 · 08:27

HuM-Eval framework improves video generation quality assessment

Researchers have developed HuM-Eval, a new framework designed to better evaluate the quality of human motion in generated videos. This system employs a coarse-to-fine strategy, first using a Vision Language Model for a …
RESEARCH · CL_11695 · Apr 28 · 08:25

New LLM techniques and benchmarks advance 3D indoor scene generation

Researchers have developed new methods for generating 3D indoor scenes using AI, addressing challenges like spatial errors and data scarcity. One approach, SpatialGrammar, introduces a domain-specific language to repres…
RESEARCH · CL_08218 · Apr 28 · 05:30

VLMs show task-dependent uncertainty in multimodal evaluation, impacting scoring reliability.

A new paper introduces conformal prediction to assess the reliability of vision-language models (VLMs) when used as automated judges for multimodal systems. The research reveals that the uncertainty in VLM evaluations i…
RESEARCH · CL_07017 · Apr 28 · 04:00

New training methods boost VLM mobile agents' interactive and safety capabilities

Researchers have developed two new approaches for enhancing the capabilities of vision-language model (VLM)-based mobile agents. Mobile-R1 introduces a hierarchical curriculum to improve exploration and self-correction,…

Coral and CoRAL systems optimize LLM serving and robotic control

Robots gain semantic understanding with VLM and adaptive memory

GeoThinker framework actively integrates geometry for advanced spatial reasoning

New benchmarks tackle 'Entity Identity Confusion' in LLM knowledge editing

DexSim2Real uses foundation models to bridge sim-to-real gap in robotics

AI advancements span XQuery conversion, OCR pipelines, and China's benchmark challenges

New framework uses VLM distillation for stable continual model adaptation

Vision-language models mistake head orientation for gaze direction

OmniDrive-R1 enhances autonomous driving VLMs with reinforcement-driven visual grounding

OpAgent achieves 71.6% success rate in web navigation tasks

AI drafts boost audio description quality, but quality threshold is key

MARVIS system uses VLM reasoning over visualizations for predictive tasks

ChartVerse framework synthesizes complex charts and reasoning data for VLMs

New benchmark and framework assess VLM robustness and ethical consistency

WorldArena benchmark evaluates world models for functional utility beyond video generation

New frameworks MCM-VG and DEGround advance zero-shot 3D visual grounding

HuM-Eval framework improves video generation quality assessment

New LLM techniques and benchmarks advance 3D indoor scene generation

VLMs show task-dependent uncertainty in multimodal evaluation, impacting scoring reliability.

New training methods boost VLM mobile agents' interactive and safety capabilities