实体 vision-language model

vision-language model

PulseAugur coverage of vision-language model — every cluster mentioning vision-language model across labs, papers, and developer communities, ranked by signal.

Show in brief

总计 · 30天

111

90 天内 111

发布 · 30天

90 天内 0

论文 · 30天

107

90 天内 107

层级分布 · 90 天

significant 1
research 42
tool 66
commentary 2

关系

instance of Vision Language Models 90%
instance of MLLMs 90%
used by VSI-Bench 70%
used by foundation model 70%
instance of foundation model 70%
instance of Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond 70%

时间线

2026-05-19 research_milestone A new method is proposed to improve out-of-distribution visual document understanding in VLMs. 来源

情绪 · 30 天

17 天有情绪数据

最近 · 第 4/6 页 · 共 111 条

RESEARCH · CL_26359 · May 11 · 10:12

GPT-5 Mini leads Agentick benchmark, but no agent paradigm dominates

The new Agentick benchmark, which assesses various AI agents across 37 tasks, shows GPT-5 Mini achieving the top score of 0.309. However, no single agent paradigm, including reinforcement learning, LLM, VLM, or hybrid a…
TOOL · CL_25598 · May 8 · 08:53

New SAEgis framework detects adversarial attacks on vision-language models

Researchers have developed a new framework called SAEgis to detect adversarial attacks on vision-language models (VLMs). This method utilizes sparse autoencoders (SAEs) as a plug-and-play module, requiring no additional…
TOOL · CL_22401 · May 8 · 04:00

ChartZero uses synthetic data to extract chart data without real-world annotation

Researchers have developed ChartZero, a novel framework designed to extract data from line charts with zero-shot capabilities. This approach bypasses the need for real-world annotations by training exclusively on synthe…
TOOL · CL_22124 · May 8 · 04:00

CompART training improves VLM multi-object grounding and visual understanding

Researchers have developed a new training method called Compositional Attention-Regularized Training (CompART) to improve how Vision-Language Models (VLMs) handle complex, multi-object references. Current VLMs struggle …
RESEARCH · CL_21791 · May 7 · 16:01

GeoStack framework enables efficient VLM knowledge composition, preventing catastrophic forgetting.

Researchers have developed GeoStack, a novel framework designed to enhance knowledge composition in Vision-Language Models (VLMs). This approach addresses the issue of catastrophic forgetting, where models lose previous…
TOOL · CL_20775 · May 7 · 04:00

Consensus Entropy improves VLM OCR accuracy by measuring inter-model agreement

Researchers have developed a new metric called Consensus Entropy (CE) to assess the reliability of Optical Character Recognition (OCR) outputs from Vision-Language Models (VLMs). CE measures the agreement between multip…
TOOL · CL_20754 · May 7 · 04:00

Researchers propose new framework for generative recommendation systems

Researchers have developed a new framework to improve the generation of Semantic IDs (SIDs) for generative recommendation systems. This approach addresses issues of information and semantic degradation by integrating de…
RESEARCH · CL_20275 · May 6 · 17:33

PhysForge generates physics-grounded 3D assets for virtual worlds and embodied AI

Researchers have introduced PhysForge, a novel framework designed to generate physics-grounded 3D assets for interactive virtual worlds and embodied AI. This system addresses the limitations of existing methods by focus…
RESEARCH · CL_20307 · May 6 · 06:57

New AI models InterMesh and Anny-Fit advance 3D human pose and shape recovery

Researchers have developed InterMesh, a new framework for multi-person human mesh recovery that explicitly incorporates human-environment interaction information. This approach enhances pose and shape estimation by enri…
TOOL · CL_18874 · May 6 · 04:00

VLM pipeline enables viewpoint-agnostic grasping for robots with partial observations

Researchers have developed a new end-to-end pipeline for language-guided grasping that enhances the robustness of mobile manipulators in cluttered environments. This system uses visual-language models (VLMs) and partial…
RESEARCH · CL_18576 · May 6 · 04:00

Researchers unveil new stealthy backdoor attacks on AI models using diffusion and style features

Researchers have developed new methods for backdoor attacks on advanced AI models, specifically targeting Vision-Language Models (VLMs) and Diffusion Models (DMs). One approach, CBV, uses diffusion models to create natu…
RESEARCH · CL_18299 · May 5 · 14:08

New GLANCE framework enhances VLM agents with curiosity-driven visual-linguistic exploration

Researchers have developed a new framework called GLANCE to enhance the exploration capabilities of Visual-Linguistic Model (VLM) agents. This framework aims to improve how these agents navigate complex and partially ob…
TOOL · CL_15782 · May 5 · 04:00

New benchmark reveals video models forget long-term context

Researchers have introduced SceneBench, a new benchmark designed to evaluate video understanding models' ability to retain context over long videos, particularly across different scenes. Their findings indicate that cur…
TOOL · CL_15622 · May 5 · 04:00

VISTA benchmark launched for advanced VLM spatio-temporal interaction analysis

Researchers have introduced VISTA, a new benchmark designed to evaluate the spatio-temporal understanding capabilities of Vision-Language Models (VLMs). Unlike existing benchmarks that focus on simple actions and limite…
TOOL · CL_15616 · May 5 · 04:00

Researchers propose Gromov-Wasserstein distance for VLM vision encoder selection

Researchers have developed a new method for selecting optimal vision encoders for Vision-Language Models (VLMs). Traditional approaches, like choosing encoders with high accuracy or large size, were found to be ineffect…
TOOL · CL_15611 · May 5 · 04:00

Chain of Evidence framework enables pixel-level visual attribution for retrieval-augmented generation

Researchers have developed a new framework called Chain of Evidence (CoE) to improve iterative retrieval-augmented generation (iRAG) systems. CoE utilizes Vision-Language Models to directly analyze screenshots of retrie…
RESEARCH · CL_16299 · May 4 · 13:49

Coral and CoRAL systems optimize LLM serving and robotic control

Researchers have developed two distinct systems named Coral and CoRAL. Coral is an adaptive system designed for cost-efficient serving of multiple large language models across heterogeneous cloud GPUs, aiming to optimiz…
RESEARCH · CL_16304 · May 4 · 12:27

Robots gain semantic understanding with VLM and adaptive memory

Researchers have developed a "Semantic Autonomy Stack" to enable indoor mobile robots to understand natural language instructions, overcoming the latency and memory limitations of current Vision-Language Models (VLMs). …
RESEARCH · CL_14362 · May 4 · 04:00

GeoThinker framework actively integrates geometry for advanced spatial reasoning

Researchers have developed GeoThinker, a novel framework that enhances spatial reasoning in multimodal large language models (MLLMs) by actively integrating geometric information. Unlike previous passive fusion methods,…
RESEARCH · CL_21819 · May 3 · 22:46

New benchmarks tackle 'Entity Identity Confusion' in LLM knowledge editing

Researchers have identified a new failure mode in multimodal knowledge editing called Entity Identity Confusion (EIC), where edited vision-language models incorrectly associate new entity information with original image…

GPT-5 Mini leads Agentick benchmark, but no agent paradigm dominates

New SAEgis framework detects adversarial attacks on vision-language models

ChartZero uses synthetic data to extract chart data without real-world annotation

CompART training improves VLM multi-object grounding and visual understanding

GeoStack framework enables efficient VLM knowledge composition, preventing catastrophic forgetting.

Consensus Entropy improves VLM OCR accuracy by measuring inter-model agreement

Researchers propose new framework for generative recommendation systems

PhysForge generates physics-grounded 3D assets for virtual worlds and embodied AI

New AI models InterMesh and Anny-Fit advance 3D human pose and shape recovery

VLM pipeline enables viewpoint-agnostic grasping for robots with partial observations

Researchers unveil new stealthy backdoor attacks on AI models using diffusion and style features

New GLANCE framework enhances VLM agents with curiosity-driven visual-linguistic exploration

New benchmark reveals video models forget long-term context

VISTA benchmark launched for advanced VLM spatio-temporal interaction analysis

Researchers propose Gromov-Wasserstein distance for VLM vision encoder selection

Chain of Evidence framework enables pixel-level visual attribution for retrieval-augmented generation

Coral and CoRAL systems optimize LLM serving and robotic control

Robots gain semantic understanding with VLM and adaptive memory

GeoThinker framework actively integrates geometry for advanced spatial reasoning

New benchmarks tackle 'Entity Identity Confusion' in LLM knowledge editing