multimodal large language model
PulseAugur coverage of multimodal large language model — every cluster mentioning multimodal large language model across labs, papers, and developer communities, ranked by signal.
5 day(s) with sentiment data
-
New MLLM framework unifies surgical scene understanding
Researchers have developed SurgMLLM, a novel framework that unifies surgical scene understanding by integrating high-level reasoning with low-level visual grounding. This multimodal large language model (MLLM) is fine-t…
-
AlphaGRPO framework boosts multimodal AI generation with self-reflection
Researchers have introduced AlphaGRPO, a new framework designed to improve multimodal generation in Unified Multimodal Models (UMMs). This approach uses Group Relative Policy Optimization (GRPO) to enable models to perf…
-
New MPerS method uses MLLMs for remote sensing scene segmentation
Researchers have developed MPerS, a novel approach for remote sensing scene segmentation that leverages multimodal large language models (MLLMs). This method generates high-quality captions for remote sensing images usi…
-
New MLLM WeatherSyn generates weather reports, outperforms existing models
Researchers have introduced WeatherSyn, a novel instruction-tuned multimodal large language model (MLLM) designed for generating weather forecast reports. This model is trained on a new dataset, , which includes data f…
-
New benchmarks and models advance video understanding reward modeling
Researchers have developed new methods for training reward models for video understanding tasks, addressing a gap in current AI capabilities. One approach introduces a benchmark called VURB and a dataset VUP-35K, leadin…
-
Motion-MLLM enhances 3D scene understanding with egomotion data
Researchers have developed Motion-MLLM, a new framework that integrates egomotion data from Inertial Measurement Units (IMUs) with video to enhance Multimodal Large Language Models (MLLMs) for 3D scene understanding. Th…
-
RemoteZero framework enables geospatial reasoning without human annotations
Researchers have introduced RemoteZero, a novel framework designed for geospatial reasoning that eliminates the need for human-annotated ground-truth coordinates. This approach leverages an MLLM's stronger ability to ve…
-
Valley3 model scales multimodal AI for global e-commerce tasks
Researchers have introduced Valley3, a new omni multimodal large language model designed for e-commerce applications. This model integrates text, image, video, and audio understanding, with a particular focus on multili…
-
New ReasonAudio benchmark reveals AI struggles with complex audio reasoning
Researchers have introduced ReasonAudio, a new benchmark designed to evaluate text-audio retrieval models on complex reasoning tasks beyond simple semantic matching. The benchmark includes 1,000 queries and 1,000 audio …
-
New benchmarks challenge MLLMs' spatial and functional reasoning abilities
Researchers have introduced new benchmarks to evaluate the spatial and functional reasoning capabilities of multimodal large language models (MLLMs). These benchmarks aim to move beyond basic geometric perception to ass…
-
New AI methods enhance video temporal grounding with MLLMs and graph networks
Researchers have developed two new frameworks for Temporal Video Grounding (TVG), a task focused on localizing specific moments in videos based on text queries. The MASRA framework utilizes a Multimodal Large Language M…
-
New framework enables scalable video understanding with multi-agent collaboration
Researchers have introduced a Multi-Agent Collaboration Framework (MACF) designed to enhance the understanding of long videos by multi-modal large language models (MLLMs). MACF addresses the context budget limitations o…
-
MLLM feedback on student drawings shows significant grounding failures
A new study published on arXiv reveals significant grounding failures in multimodal large language models (MLLMs) when generating feedback on student science drawings. Researchers found that 41.3% of feedback instances …
-
New VeriGround model achieves reliable circuit-to-Verilog code generation
Researchers have identified a significant reliability issue in multimodal large language models (MLLMs) when generating hardware description language (HDL) code from circuit diagrams. This "Mirage" phenomenon occurs whe…
-
OcularChat MLLM accurately diagnoses age-related macular degeneration with interactive explanations
Researchers have developed OcularChat, a multimodal large language model (MLLM) fine-tuned from Qwen2.5-VL, designed to diagnose age-related macular degeneration (AMD) using color fundus photographs. The model was train…
-
New AI methods boost industrial anomaly detection with multimodal data and LLMs
Researchers have developed three new frameworks for industrial anomaly detection using multimodal data and advanced AI techniques. One approach, EAGLE, integrates expert anomaly detectors with frozen multimodal large la…
-
Audio-Omni framework unifies audio generation, editing, and understanding
Researchers have introduced Audio-Omni, a novel framework designed to unify audio understanding, generation, and editing across diverse domains like speech, music, and general sounds. This system integrates a frozen Mul…
-
Chat-Scene++ advances 3D LLM scene understanding with context-rich object identification
Researchers have introduced Chat-Scene++, a novel framework designed to enhance multi-modal large language models (MLLMs) for 3D scene understanding. This approach structures 3D scenes as sequences of objects, incorpora…
-
MLLMs adapted for nuanced video retrieval, achieving SOTA performance
Researchers have developed a novel method for video retrieval that enhances understanding of nuanced queries. This approach adapts Multimodal Large Language Models (MLLMs) to better interpret temporal actions, negations…
-
Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives
Researchers have explored token pruning strategies for GUI visual agents that utilize Multimodal Large Language Models (MLLMs). Their study revealed that background regions in screenshots, often overlooked, can provide …