ENTITY Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond

Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond

PulseAugur coverage of Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond — every cluster mentioning Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond across labs, papers, and developer communities, ranked by signal.

Total · 30d

0 over 90d

Releases · 30d

0 over 90d

Papers · 30d

0 over 90d

TIER MIX · 90D

No coverage in the last 90 days.

SENTIMENT · 30D

5 day(s) with sentiment data

RECENT · PAGE 1/1 · 19 TOTAL

TOOL · CL_30586 · May 13 · 03:52

New GSEC framework uses LLMs for improved image clustering

Researchers have developed a new image clustering framework called GSEC, which utilizes generative semantic guidance and a bi-layer ensemble strategy. This approach employs Multimodal Large Language Models to create sem…
TOOL · CL_30596 · May 13 · 01:54

New benchmark CiteVQA exposes "Attribution Hallucination" in LLMs

Researchers have introduced CiteVQA, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to accurately attribute answers to specific source regions within documents. Unlike pre…
TOOL · CL_30605 · May 12 · 19:33

New benchmark reveals AI models lag human experts in judging image beauty

Researchers have developed the Visual Aesthetic Benchmark (VAB) to evaluate how well multimodal large language models (MLLMs) can judge beauty in images. Their study found that current frontier MLLMs perform significant…
TOOL · CL_29251 · May 12 · 17:11

New benchmark reveals MLLMs struggle with spatial reasoning

Researchers have introduced PCSR-Bench, a new diagnostic benchmark designed to evaluate the spatial reasoning capabilities of multimodal large language models (MLLMs) when processing omnidirectional images. The benchmar…
TOOL · CL_29402 · May 12 · 14:07

New benchmark tests multimodal LLMs on complex optimization tasks

Researchers have introduced MM-OptBench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on optimization modeling tasks. This benchmark incorporates both text and visual information, a depa…
TOOL · CL_29435 · May 12 · 07:22

New multimodal benchmark uses 900K Japanese student responses

Researchers have developed a new multimodal benchmark using data from Japan's National Assessment of Academic Ability, which includes approximately 900,000 aggregated student responses. This dataset features real exam m…
TOOL · CL_27553 · May 11 · 08:21

New V-ABS framework enhances multimodal visual reasoning

Researchers have developed V-ABS, a novel beam search framework designed to improve multi-step visual reasoning in multimodal large language models. This approach addresses the imagination-action-observer bias by iterat…
TOOL · CL_25753 · May 8 · 16:57

SphereVAD uses LLM features for training-free video anomaly detection

Researchers have developed SphereVAD, a novel framework for video anomaly detection that operates without requiring any task-specific training. This method leverages the rich semantic information already present in the …
RESEARCH · CL_22410 · May 8 · 04:00

New benchmarks and models advance video understanding reward modeling

Researchers have developed new methods for training reward models for video understanding tasks, addressing a gap in current AI capabilities. One approach introduces a benchmark called VURB and a dataset VUP-35K, leadin…
TOOL · CL_26980 · May 5 · 19:12

Pro$^2$Assist uses AR and LLMs for proactive procedural task guidance

Researchers have developed Pro$^2$Assist, a new multimodal large language model system designed to offer continuous, step-aware proactive assistance for complex, long-horizon procedural tasks. Unlike previous assistants…
TOOL · CL_15620 · May 5 · 04:00

VoxAfford improves 3D affordance detection with multi-scale voxel-token fusion

Researchers have developed VoxAfford, a novel method for open-vocabulary 3D affordance detection. This approach enhances multimodal large language models by integrating multi-scale geometric features from a 3D VQVAE enc…
RESEARCH · CL_15594 · May 5 · 04:00

New research tackles conflicting data in multimodal emotion recognition

Researchers have developed new methods to improve multimodal emotion recognition, which combines text, audio, and vision data. One approach, Dual-Path Conflict Resolution (DCR), learns to either fuse conflicting modalit…
RESEARCH · CL_11400 · Apr 30 · 03:59

COHERENCE benchmark evaluates MLLMs' fine-grained image-text alignment in interleaved contexts

Researchers have introduced COHERENCE, a new benchmark designed to assess the fine-grained image-text alignment capabilities of Multimodal Large Language Models (MLLMs). Existing benchmarks often overlook the complexiti…
RESEARCH · CL_06542 · Apr 28 · 04:00

Researchers develop new methods for knowledge graph retrieval and completion

Researchers have developed new frameworks to enhance knowledge graph completion and visual question answering by integrating multimodal knowledge graphs with retrieval-augmented generation techniques. One approach, RADD…
RESEARCH · CL_06531 · Apr 28 · 04:00

OmniVTG dataset and CoT paradigm enhance open-world video temporal grounding

Researchers have introduced OmniVTG, a large-scale dataset and training paradigm designed to improve open-world Video Temporal Grounding (VTG) for Multimodal Large Language Models (MLLMs). The dataset was created using …
RESEARCH · CL_06275 · Apr 27 · 11:44

OS-SPEAR toolkit evaluates AI agents for safety, performance, efficiency, and robustness

Researchers have introduced OS-SPEAR, a new toolkit designed to rigorously evaluate operating system agents. This toolkit assesses agents across four key dimensions: safety, performance, efficiency, and robustness. OS-S…
RESEARCH · CL_04921 · Apr 24 · 12:20

MLLMs predict mouse social dominance in novel MTT-Bench benchmark

Researchers have developed MTT-Bench, a new benchmark for analyzing mouse social dominance using Multimodal Large Language Models (MLLMs). This framework fine-tunes existing MLLM architectures to predict dominance hiera…
RESEARCH · CL_05414 · Apr 22 · 03:17

SAKE framework enhances multimodal NER with self-aware knowledge exploitation

Researchers have developed SAKE, a new framework designed to improve Grounded Multimodal Named Entity Recognition (GMNER). SAKE addresses challenges in open-world environments, such as identifying long-tailed and evolvi…
RESEARCH · CL_05425 · Apr 21 · 12:10

Air-Know network tackles composed image retrieval with novel expert-proxy-diversion paradigm

Researchers have introduced Air-Know, a novel network designed to tackle the Composed Image Retrieval (CIR) challenge, specifically addressing the Noisy Triplet Correspondence (NTC) problem. Existing methods struggle wi…

New GSEC framework uses LLMs for improved image clustering

New benchmark CiteVQA exposes "Attribution Hallucination" in LLMs

New benchmark reveals AI models lag human experts in judging image beauty

New benchmark reveals MLLMs struggle with spatial reasoning

New benchmark tests multimodal LLMs on complex optimization tasks

New multimodal benchmark uses 900K Japanese student responses

New V-ABS framework enhances multimodal visual reasoning

SphereVAD uses LLM features for training-free video anomaly detection

New benchmarks and models advance video understanding reward modeling

Pro$^2$Assist uses AR and LLMs for proactive procedural task guidance

VoxAfford improves 3D affordance detection with multi-scale voxel-token fusion

New research tackles conflicting data in multimodal emotion recognition

COHERENCE benchmark evaluates MLLMs' fine-grained image-text alignment in interleaved contexts

Researchers develop new methods for knowledge graph retrieval and completion

OmniVTG dataset and CoT paradigm enhance open-world video temporal grounding

OS-SPEAR toolkit evaluates AI agents for safety, performance, efficiency, and robustness

MLLMs predict mouse social dominance in novel MTT-Bench benchmark

SAKE framework enhances multimodal NER with self-aware knowledge exploitation

Air-Know network tackles composed image retrieval with novel expert-proxy-diversion paradigm