MLLMs
PulseAugur coverage of MLLMs — every cluster mentioning MLLMs across labs, papers, and developer communities, ranked by signal.
- 2026-05-22 research_milestone A new pipeline was introduced to enhance MLLMs for safety-critical driving video analysis. 来源
- 2026-05-22 research_milestone Researchers reveal and propose a method to recover temporal grounding in multimodal large language models. 来源
- 2026-05-22 research_milestone A new benchmark and dataset were introduced to evaluate MLLMs' ability to reason about personality beyond superficial cues. 来源
- 2026-05-21 research_milestone A new method using MLLMs for detecting AI-generated Chinese poetry achieves state-of-the-art results. 来源
9 天有情绪数据
-
MLLMs struggle with Chinese short-video misinformation, Gemini-2.5-Pro leads
Researchers have developed a new framework to evaluate how well Multimodal Large Language Models (MLLMs) can identify misinformation in Chinese short videos. The study utilized a dataset of 200 videos annotated for dece…
-
New AI models tackle complex chart reasoning and generation challenges
Researchers have developed new frameworks and benchmarks to improve how multimodal large language models (MLLMs) reason across complex visual data, such as charts. One approach, HierVA, uses a hierarchical agent to mana…
-
VideoDetective framework enhances long video understanding for MLLMs
Researchers have introduced VideoDetective, a novel framework designed to enhance the understanding of long videos by multimodal large language models (MLLMs). This approach addresses the challenge of limited context wi…
-
GeoThinker framework actively integrates geometry for advanced spatial reasoning
Researchers have developed GeoThinker, a novel framework that enhances spatial reasoning in multimodal large language models (MLLMs) by actively integrating geometric information. Unlike previous passive fusion methods,…
-
FreeRet 框架将多模态大语言模型转变为无训练检索器
研究人员开发了 FreeRet,一个新颖的框架,使多模态大语言模型 (MLLMs) 能够在无需额外训练的情况下有效充当检索器。这个即插即用系统从现成的 MLLMs 中提取语义基础的嵌入,用于初步候选搜索,然后利用其推理能力进行精确的重排序。FreeRet 在 MMEB 和 MMEB-V2 基准测试中,展示了比在数百万对数据上训练的模型显著的性能提升,显示了其在单一模型内统一检索、重排序和生成功能的潜力。
-
GuideDog dataset aids blind and low-vision navigation with egocentric multimodal data
Researchers have introduced GuideDog, a new dataset designed to aid the development of multimodal large language models (MLLMs) for blind and low-vision (BLV) individuals. The dataset comprises 22,000 image-description …
-
New benchmark tackles visual-semantic knowledge conflicts in surgical AI
Researchers have introduced OR-VSKC, a new benchmark designed to address visual-semantic knowledge conflicts in multimodal large language models (MLLMs) within operating room settings. The benchmark utilizes 28,190 high…
-
New AEGIS benchmark reveals AI image forensics lag behind generative advances
Researchers have introduced AEGIS, a new benchmark designed to evaluate the forensic analysis of AI-generated academic images. This benchmark addresses domain-specific complexity across seven academic categories and inc…
-
New SPUR benchmark reveals AI models struggle with scientific image interpretation
Researchers have introduced the SPUR benchmark, designed to evaluate multimodal large language models (MLLMs) on their ability to interpret scientific experimental images. SPUR includes over 4,000 question-answering pai…
-
New STAR-64K dataset and training framework boost MLLM reasoning
Researchers have developed a new method for training multi-modal large language models (MLLMs) to improve their ability to reason with abstract relational knowledge presented in images. This approach involves an automat…
-
ReGATE方法通过选择性修剪Token来加速多模态LLM训练
研究人员开发了ReGATE,一种通过自适应修剪Token来加速多模态大型语言模型(MLLM)训练的新颖方法。该技术使用一种教师-学生框架,其中一个固定的教师模型指导学生在训练过程中识别和丢弃冗余Token。ReGATE已证明,在MVBench等基准测试中,其速度最高可达标准方法的两倍,同时显著减少了处理的Token数量,并能达到峰值准确率。
-
COHERENCE benchmark evaluates MLLMs' fine-grained image-text alignment in interleaved contexts
Researchers have introduced COHERENCE, a new benchmark designed to assess the fine-grained image-text alignment capabilities of Multimodal Large Language Models (MLLMs). Existing benchmarks often overlook the complexiti…
-
New framework improves MLLMs' accuracy in dial-based measurement reading
Researchers have identified a significant weakness in multimodal large language models (MLLMs) when it comes to reading dial-based measurements. These models struggle with accuracy and are highly sensitive to changes in…
-
SIEVES method boosts multimodal LLM coverage on visual tasks with evidence scoring
Researchers have developed SIEVES, a novel method for improving the reliability of multimodal large language models (MLLMs) in out-of-distribution scenarios. SIEVES works by learning to estimate the quality of visual ev…
-
CrossGuard 保护多模态大语言模型免受隐式和显式攻击
研究人员开发了 CrossGuard,一种旨在保护多模态大语言模型 (MLLM) 免受复杂隐式攻击的新型防御系统。这些攻击将看似无害的文本和图像输入结合起来传达恶意意图,使其难以检测。为了解决这个问题,该团队还创建了 ImpForge,一个自动化的管道,用于生成多样化的隐式攻击样本以进行训练和评估。实验表明,与现有防御措施相比,CrossGuard 在抵抗隐式和显式威胁方面提供了卓越的保护,同时保持了模型的效用。
-
MLLMs tested on reconstructing masked text from visual context with MMTR-Bench
Researchers have developed MMTR-Bench, a new benchmark designed to test the ability of Multimodal Large Language Models (MLLMs) to reconstruct missing text solely from visual context. This benchmark avoids explicit prom…
-
AI system SoccerRef-Agents uses multi-agent reasoning for soccer refereeing
Researchers have introduced SoccerRef-Agents, a multi-agent system designed to automate soccer refereeing with enhanced accuracy and explainability. The framework incorporates a new benchmark dataset, SoccerRefBench, fe…
-
New methods enhance LLMs for fine-grained visual recognition tasks
Two new research papers propose novel methods for improving Fine-Grained Visual Recognition (FGVR) using Large Vision-Language Models (LVLMs). The first paper introduces SARE, a framework that adaptively applies reasoni…
-
OmniVTG数据集和CoT范式增强了开放世界视频时序定位
研究人员推出了OmniVTG,这是一个大规模数据集和训练范式,旨在改进多模态大语言模型(MLLMs)的开放世界视频时序定位(VTG)。该数据集采用新颖的流程来识别和收集包含代表性不足概念的视频,并采用以字幕为中心的策略进行高质量标注。此外,还提出了一种自校正思维链(CoT)训练方法,该方法利用MLLMs的理解能力来优化预测,在现有基准和新的OmniVTG数据集上均取得了最先进的性能。
-
New benchmark reveals AI models struggle with ego-motion understanding in driving
Researchers have developed EgoDyn-Bench, a new benchmark designed to evaluate how well vision-centric foundation models understand ego-motion in autonomous driving scenarios. The benchmark reveals a significant 'Percept…