ENTITY multimodal large language model

multimodal large language model

PulseAugur coverage of multimodal large language model — every cluster mentioning multimodal large language model across labs, papers, and developer communities, ranked by signal.

Total · 30d

25 over 90d

Releases · 30d

0 over 90d

Papers · 30d

25 over 90d

TIER MIX · 90D

SENTIMENT · 30D

5 day(s) with sentiment data

RECENT · PAGE 1/2 · 22 TOTAL

TOOL · CL_30750 · May 13 · 13:42

New MLLM framework unifies surgical scene understanding

Researchers have developed SurgMLLM, a novel framework that unifies surgical scene understanding by integrating high-level reasoning with low-level visual grounding. This multimodal large language model (MLLM) is fine-t…
TOOL · CL_29245 · May 12 · 17:59

AlphaGRPO framework boosts multimodal AI generation with self-reflection

Researchers have introduced AlphaGRPO, a new framework designed to improve multimodal generation in Unified Multimodal Models (UMMs). This approach uses Group Relative Policy Optimization (GRPO) to enable models to perf…
TOOL · CL_27987 · May 11 · 16:00

New MPerS method uses MLLMs for remote sensing scene segmentation

Researchers have developed MPerS, a novel approach for remote sensing scene segmentation that leverages multimodal large language models (MLLMs). This method generates high-quality captions for remote sensing images usi…
TOOL · CL_25593 · May 8 · 09:53

New MLLM WeatherSyn generates weather reports, outperforms existing models

Researchers have introduced WeatherSyn, a novel instruction-tuned multimodal large language model (MLLM) designed for generating weather forecast reports. This model is trained on a new dataset, , which includes data f…
RESEARCH · CL_22410 · May 8 · 04:00

New benchmarks and models advance video understanding reward modeling

Researchers have developed new methods for training reward models for video understanding tasks, addressing a gap in current AI capabilities. One approach introduces a benchmark called VURB and a dataset VUP-35K, leadin…
TOOL · CL_22442 · May 8 · 04:00

Motion-MLLM enhances 3D scene understanding with egomotion data

Researchers have developed Motion-MLLM, a new framework that integrates egomotion data from Inertial Measurement Units (IMUs) with video to enhance Multimodal Large Language Models (MLLMs) for 3D scene understanding. Th…
TOOL · CL_20769 · May 7 · 04:00

RemoteZero framework enables geospatial reasoning without human annotations

Researchers have introduced RemoteZero, a novel framework designed for geospatial reasoning that eliminates the need for human-annotated ground-truth coordinates. This approach leverages an MLLM's stronger ability to ve…
TOOL · CL_18547 · May 6 · 04:00

Valley3 model scales multimodal AI for global e-commerce tasks

Researchers have introduced Valley3, a new omni multimodal large language model designed for e-commerce applications. This model integrates text, image, video, and audio understanding, with a particular focus on multili…
TOOL · CL_26969 · May 5 · 04:44

New ReasonAudio benchmark reveals AI struggles with complex audio reasoning

Researchers have introduced ReasonAudio, a new benchmark designed to evaluate text-audio retrieval models on complex reasoning tasks beyond simple semantic matching. The benchmark includes 1,000 queries and 1,000 audio …
RESEARCH · CL_15684 · May 5 · 04:00

New benchmarks challenge MLLMs' spatial and functional reasoning abilities

Researchers have introduced new benchmarks to evaluate the spatial and functional reasoning capabilities of multimodal large language models (MLLMs). These benchmarks aim to move beyond basic geometric perception to ass…
RESEARCH · CL_14055 · May 1 · 14:16

New AI methods enhance video temporal grounding with MLLMs and graph networks

Researchers have developed two new frameworks for Temporal Video Grounding (TVG), a task focused on localizing specific moments in videos based on text queries. The MASRA framework utilizes a Multimodal Large Language M…
RESEARCH · CL_14082 · May 1 · 06:24

New framework enables scalable video understanding with multi-agent collaboration

Researchers have introduced a Multi-Agent Collaboration Framework (MACF) designed to enhance the understanding of long videos by multi-modal large language models (MLLMs). MACF addresses the context budget limitations o…
RESEARCH · CL_11705 · May 1 · 04:00

MLLM feedback on student drawings shows significant grounding failures

A new study published on arXiv reveals significant grounding failures in multimodal large language models (MLLMs) when generating feedback on student science drawings. Researchers found that 41.3% of feedback instances …
RESEARCH · CL_11488 · Apr 30 · 15:01

New VeriGround model achieves reliable circuit-to-Verilog code generation

Researchers have identified a significant reliability issue in multimodal large language models (MLLMs) when generating hardware description language (HDL) code from circuit diagrams. This "Mirage" phenomenon occurs whe…
RESEARCH · CL_08185 · Apr 28 · 14:46

OcularChat MLLM accurately diagnoses age-related macular degeneration with interactive explanations

Researchers have developed OcularChat, a multimodal large language model (MLLM) fine-tuned from Qwen2.5-VL, designed to diagnose age-related macular degeneration (AMD) using color fundus photographs. The model was train…
RESEARCH · CL_06429 · Apr 28 · 04:00

New AI methods boost industrial anomaly detection with multimodal data and LLMs

Researchers have developed three new frameworks for industrial anomaly detection using multimodal data and advanced AI techniques. One approach, EAGLE, integrates expert anomaly detectors with frozen multimodal large la…
RESEARCH · CL_06609 · Apr 28 · 04:00

Audio-Omni framework unifies audio generation, editing, and understanding

Researchers have introduced Audio-Omni, a novel framework designed to unify audio understanding, generation, and editing across diverse domains like speech, music, and general sounds. This system integrates a frozen Mul…
RESEARCH · CL_06582 · Apr 28 · 04:00

Chat-Scene++ advances 3D LLM scene understanding with context-rich object identification

Researchers have introduced Chat-Scene++, a novel framework designed to enhance multi-modal large language models (MLLMs) for 3D scene understanding. This approach structures 3D scenes as sequences of objects, incorpora…
RESEARCH · CL_05108 · Apr 27 · 04:00

MLLMs adapted for nuanced video retrieval, achieving SOTA performance

Researchers have developed a novel method for video retrieval that enhances understanding of nuanced queries. This approach adapts Multimodal Large Language Models (MLLMs) to better interpret temporal actions, negations…
RESEARCH · CL_05114 · Apr 27 · 04:00

Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives

Researchers have explored token pruning strategies for GUI visual agents that utilize Multimodal Large Language Models (MLLMs). Their study revealed that background regions in screenshots, often overlooked, can provide …

New MLLM framework unifies surgical scene understanding

AlphaGRPO framework boosts multimodal AI generation with self-reflection

New MPerS method uses MLLMs for remote sensing scene segmentation

New MLLM WeatherSyn generates weather reports, outperforms existing models

New benchmarks and models advance video understanding reward modeling

Motion-MLLM enhances 3D scene understanding with egomotion data

RemoteZero framework enables geospatial reasoning without human annotations

Valley3 model scales multimodal AI for global e-commerce tasks

New ReasonAudio benchmark reveals AI struggles with complex audio reasoning

New benchmarks challenge MLLMs' spatial and functional reasoning abilities

New AI methods enhance video temporal grounding with MLLMs and graph networks

New framework enables scalable video understanding with multi-agent collaboration

MLLM feedback on student drawings shows significant grounding failures

New VeriGround model achieves reliable circuit-to-Verilog code generation

OcularChat MLLM accurately diagnoses age-related macular degeneration with interactive explanations

New AI methods boost industrial anomaly detection with multimodal data and LLMs

Audio-Omni framework unifies audio generation, editing, and understanding

Chat-Scene++ advances 3D LLM scene understanding with context-rich object identification

MLLMs adapted for nuanced video retrieval, achieving SOTA performance

Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives