MLLMs
PulseAugur coverage of MLLMs — every cluster mentioning MLLMs across labs, papers, and developer communities, ranked by signal.
- instance of Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond 90%
- used by Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond 70%
- used by train of thought 70%
- used by Standard Chinese 70%
- used by Chain Of Thought 70%
- used by English 60%
- 2026-05-22 research_milestone A new pipeline was introduced to enhance MLLMs for safety-critical driving video analysis. source
- 2026-05-22 research_milestone Researchers reveal and propose a method to recover temporal grounding in multimodal large language models. source
- 2026-05-22 research_milestone A new benchmark and dataset were introduced to evaluate MLLMs' ability to reason about personality beyond superficial cues. source
- 2026-05-21 research_milestone A new method using MLLMs for detecting AI-generated Chinese poetry achieves state-of-the-art results. source
18 day(s) with sentiment data
-
New MSEarth benchmark uses MLLMs for Earth science discovery
Researchers have developed MSEarth, a new multimodal benchmark designed to evaluate the capabilities of multimodal large language models (MLLMs) in Earth science reasoning. This dataset comprises over 289,000 figures wi…
-
New VQA methods enhance explainability and knowledge integration for multimodal LLMs
Researchers have developed CoExVQA, a new framework for Document Visual Question Answering (DocVQA) that enhances explainability by breaking down the reasoning process. This method first identifies relevant evidence, th…
-
MLLMs show promise in analyzing seizure movements, outperforming traditional models
A pilot study explored the use of multimodal large language models (MLLMs) for analyzing pathological movements in seizure videos. The research found that MLLMs, without specific training, outperformed traditional compu…
-
New AI unlearning methods balance data removal with model utility
Researchers have developed new methods for machine unlearning, a process that removes specific data from AI models without full retraining. One approach, SHRED, uses self-distillation and logit demotion to identify and …
-
New In-Prompt Process Supervision framework enhances MLLMs for video moderation
Researchers have developed a new framework called IPS (In-Prompt Process Supervision) to enhance the accuracy of multimodal large language models (MLLMs) in content moderation for short videos. This method incorporates …
-
Researchers use RL to improve MLLM regression on imbalanced data
Researchers have developed a new framework to improve how multimodal large language models (MLLMs) handle numerical regression tasks, particularly those with imbalanced data distributions. Existing training methods ofte…
-
New HERMES and DSCache methods improve streaming video understanding with KV cache
Researchers have developed new methods to improve the efficiency of multimodal large language models (MLLMs) for understanding streaming video. One approach, HERMES, conceptualizes the KV cache as a hierarchical memory …
-
VideoThinker framework improves lightweight MLLMs' video reasoning via causal debiasing
Researchers have developed VideoThinker, a novel framework designed to enhance the reasoning capabilities of lightweight multimodal language models (MLLMs) in video analysis. This approach addresses the issue of percept…
-
MLLMs show foundational visual gaps despite progress in multimodal reasoning
A new paper introduces a method to improve latent reasoning in multimodal large language models (MLLMs) by optimizing visual latents at inference time, addressing a pathology where their contribution is suppressed. Sepa…
-
New benchmark and models advance generalized moment retrieval in videos
Researchers have introduced Generalized Moment Retrieval (GMR), a new framework for video analysis that moves beyond the assumption of a single matching moment per query. This approach aims to retrieve all relevant temp…
-
MLLMs struggle with Chinese short-video misinformation, Gemini-2.5-Pro leads
Researchers have developed a new framework to evaluate how well Multimodal Large Language Models (MLLMs) can identify misinformation in Chinese short videos. The study utilized a dataset of 200 videos annotated for dece…
-
New AI models tackle complex chart reasoning and generation challenges
Researchers have developed new frameworks and benchmarks to improve how multimodal large language models (MLLMs) reason across complex visual data, such as charts. One approach, HierVA, uses a hierarchical agent to mana…
-
VideoDetective framework enhances long video understanding for MLLMs
Researchers have introduced VideoDetective, a novel framework designed to enhance the understanding of long videos by multimodal large language models (MLLMs). This approach addresses the challenge of limited context wi…
-
GeoThinker framework actively integrates geometry for advanced spatial reasoning
Researchers have developed GeoThinker, a novel framework that enhances spatial reasoning in multimodal large language models (MLLMs) by actively integrating geometric information. Unlike previous passive fusion methods,…
-
FreeRet framework turns multimodal LLMs into training-free retrievers
Researchers have developed FreeRet, a novel framework that enables multimodal large language models (MLLMs) to function as effective retrievers without requiring additional training. This plug-and-play system extracts s…
-
GuideDog dataset aids blind and low-vision navigation with egocentric multimodal data
Researchers have introduced GuideDog, a new dataset designed to aid the development of multimodal large language models (MLLMs) for blind and low-vision (BLV) individuals. The dataset comprises 22,000 image-description …
-
New benchmark tackles visual-semantic knowledge conflicts in surgical AI
Researchers have introduced OR-VSKC, a new benchmark designed to address visual-semantic knowledge conflicts in multimodal large language models (MLLMs) within operating room settings. The benchmark utilizes 28,190 high…
-
New AEGIS benchmark reveals AI image forensics lag behind generative advances
Researchers have introduced AEGIS, a new benchmark designed to evaluate the forensic analysis of AI-generated academic images. This benchmark addresses domain-specific complexity across seven academic categories and inc…
-
New SPUR benchmark reveals AI models struggle with scientific image interpretation
Researchers have introduced the SPUR benchmark, designed to evaluate multimodal large language models (MLLMs) on their ability to interpret scientific experimental images. SPUR includes over 4,000 question-answering pai…
-
New STAR-64K dataset and training framework boost MLLM reasoning
Researchers have developed a new method for training multi-modal large language models (MLLMs) to improve their ability to reason with abstract relational knowledge presented in images. This approach involves an automat…