Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond
PulseAugur coverage of Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond — every cluster mentioning Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond across labs, papers, and developer communities, ranked by signal.
No coverage in the last 90 days.
5 day(s) with sentiment data
-
New GSEC framework uses LLMs for improved image clustering
Researchers have developed a new image clustering framework called GSEC, which utilizes generative semantic guidance and a bi-layer ensemble strategy. This approach employs Multimodal Large Language Models to create sem…
-
New benchmark CiteVQA exposes "Attribution Hallucination" in LLMs
Researchers have introduced CiteVQA, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to accurately attribute answers to specific source regions within documents. Unlike pre…
-
New benchmark reveals AI models lag human experts in judging image beauty
Researchers have developed the Visual Aesthetic Benchmark (VAB) to evaluate how well multimodal large language models (MLLMs) can judge beauty in images. Their study found that current frontier MLLMs perform significant…
-
New benchmark reveals MLLMs struggle with spatial reasoning
Researchers have introduced PCSR-Bench, a new diagnostic benchmark designed to evaluate the spatial reasoning capabilities of multimodal large language models (MLLMs) when processing omnidirectional images. The benchmar…
-
New benchmark tests multimodal LLMs on complex optimization tasks
Researchers have introduced MM-OptBench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on optimization modeling tasks. This benchmark incorporates both text and visual information, a depa…
-
New multimodal benchmark uses 900K Japanese student responses
Researchers have developed a new multimodal benchmark using data from Japan's National Assessment of Academic Ability, which includes approximately 900,000 aggregated student responses. This dataset features real exam m…
-
New V-ABS framework enhances multimodal visual reasoning
Researchers have developed V-ABS, a novel beam search framework designed to improve multi-step visual reasoning in multimodal large language models. This approach addresses the imagination-action-observer bias by iterat…
-
SphereVAD uses LLM features for training-free video anomaly detection
Researchers have developed SphereVAD, a novel framework for video anomaly detection that operates without requiring any task-specific training. This method leverages the rich semantic information already present in the …
-
New benchmarks and models advance video understanding reward modeling
Researchers have developed new methods for training reward models for video understanding tasks, addressing a gap in current AI capabilities. One approach introduces a benchmark called VURB and a dataset VUP-35K, leadin…
-
Pro$^2$Assist uses AR and LLMs for proactive procedural task guidance
Researchers have developed Pro$^2$Assist, a new multimodal large language model system designed to offer continuous, step-aware proactive assistance for complex, long-horizon procedural tasks. Unlike previous assistants…
-
VoxAfford improves 3D affordance detection with multi-scale voxel-token fusion
Researchers have developed VoxAfford, a novel method for open-vocabulary 3D affordance detection. This approach enhances multimodal large language models by integrating multi-scale geometric features from a 3D VQVAE enc…
-
New research tackles conflicting data in multimodal emotion recognition
Researchers have developed new methods to improve multimodal emotion recognition, which combines text, audio, and vision data. One approach, Dual-Path Conflict Resolution (DCR), learns to either fuse conflicting modalit…
-
COHERENCE benchmark evaluates MLLMs' fine-grained image-text alignment in interleaved contexts
Researchers have introduced COHERENCE, a new benchmark designed to assess the fine-grained image-text alignment capabilities of Multimodal Large Language Models (MLLMs). Existing benchmarks often overlook the complexiti…
-
Researchers develop new methods for knowledge graph retrieval and completion
Researchers have developed new frameworks to enhance knowledge graph completion and visual question answering by integrating multimodal knowledge graphs with retrieval-augmented generation techniques. One approach, RADD…
-
OmniVTG dataset and CoT paradigm enhance open-world video temporal grounding
Researchers have introduced OmniVTG, a large-scale dataset and training paradigm designed to improve open-world Video Temporal Grounding (VTG) for Multimodal Large Language Models (MLLMs). The dataset was created using …
-
OS-SPEAR toolkit evaluates AI agents for safety, performance, efficiency, and robustness
Researchers have introduced OS-SPEAR, a new toolkit designed to rigorously evaluate operating system agents. This toolkit assesses agents across four key dimensions: safety, performance, efficiency, and robustness. OS-S…
-
MLLMs predict mouse social dominance in novel MTT-Bench benchmark
Researchers have developed MTT-Bench, a new benchmark for analyzing mouse social dominance using Multimodal Large Language Models (MLLMs). This framework fine-tunes existing MLLM architectures to predict dominance hiera…
-
SAKE framework enhances multimodal NER with self-aware knowledge exploitation
Researchers have developed SAKE, a new framework designed to improve Grounded Multimodal Named Entity Recognition (GMNER). SAKE addresses challenges in open-world environments, such as identifying long-tailed and evolvi…
-
Air-Know network tackles composed image retrieval with novel expert-proxy-diversion paradigm
Researchers have introduced Air-Know, a novel network designed to tackle the Composed Image Retrieval (CIR) challenge, specifically addressing the Noisy Triplet Correspondence (NTC) problem. Existing methods struggle wi…