Large Multimodal Models
PulseAugur coverage of Large Multimodal Models — every cluster mentioning Large Multimodal Models across labs, papers, and developer communities, ranked by signal.
5 天有情绪数据
-
LongVT框架通过工具调用增强AI视频推理能力
研究人员开发了LongVT,一个旨在改进大型多模态模型(LMM)处理和推理长视频方式的新框架。该方法通过先浏览整个视频,然后聚焦于特定片段以获取细节来模仿人类理解,并利用LMM的原生时间定位能力作为放大相关片段的工具。为了支持这一点,一个新的名为VideoSIAH的数据集已被整理,其中包含超过247,000个用于监督微调的样本以及用于强化学习的额外数据,还有一个包含1,280个问答对的基准测试。LongVT在几个具有挑战性的长视频理解…
-
AWS Strands Evals 为图像到文本任务添加多模态裁判
Amazon Web Services 推出了其 Strands Evals SDK 的新型多模态评估器,旨在评估图像到文本任务。这些工具利用大型多模态模型 (MLMM) 通过直接引用源图像来判断响应,解决了纯文本评估方法的局限性。评估器可以识别视觉幻觉和事实错误,并集成到现有的开发工作流程中以实现自动化质量控制。
-
AI research tackles temporal grounding for AVs and video analysis
Two new research papers explore methods to improve temporal grounding in AI systems, particularly for autonomous vehicles and video analysis. The first paper, "From Prompts to Pavement Through Time," investigates tempor…
-
新的 AQuaUI 方法大幅减少 GUI 代理视觉令牌
研究人员开发了 AQuaUI,一种新颖的方法,用于减少大型多模态模型 (LMM) 在与图形用户界面 (GUI) 交互时处理的视觉令牌数量。这种无需训练的技术在 GUI 屏幕截图上构建自适应四叉树,用单个令牌表示信息密度低的区域,同时保留空间关系。AQuaUI 还包含一种利用连续屏幕截图来维持时间一致性的条件算法,从而提高了 GUI 代理模型的准确性-效率权衡。
-
New benchmarks and synthetic data aim to boost AI's egocentric video understanding
Researchers have introduced new benchmarks and synthetic data generation methods to improve the performance of large multimodal models (LMMs) on egocentric video data. The EgoBabyVLM benchmark focuses on language ground…
-
新研究探索多模态模型中视觉理解与生成之间的协同作用
研究人员正在探索新的方法,通过增强视觉理解与生成之间的协同作用来改进统一的多模态模型(UMMs)。一种方法是语义生成调优(SGT),它使用图像分割作为生成代理来对齐这些能力,在理解和生成任务上表现出改进的性能。另一个模型Lance利用具有双流架构的协同多任务训练来实现类似目标,在图像和视频生成方面优于现有的开源模型。第三篇论文介绍了生成到理解(G2U)协同作用,其中像细节增强这样的生成行为被用作中间推理步骤,在不重新训练的情况下完善感…
-
新的FIKA-Bench测试AI知识获取能力,超越视觉识别
研究人员推出FIKA-Bench,一个旨在评估AI系统获取未知物体知识能力的新基准,超越了简单的视觉识别。该基准包含311个精心策划的真实世界实例,以避免数据泄露并确保证据接地。评估显示,即使是最先进的大型多模态模型和代理在该任务上也表现不佳,准确率仅为25%左右,这凸显了改进专注于细粒度识别和证据验证的代理设计的必要性。
-
New benchmarks reveal major gaps in multimodal context learning for LLMs
Two new benchmarks, MMCL-Bench and Personal-VCL-Bench, have been introduced to evaluate the multimodal context learning capabilities of large language models. MMCL-Bench focuses on learning from visual rules, procedures…
-
New method enhances LMM spatial reasoning with generated viewpoints
Researchers have introduced a new paradigm called Thinking with Novel Views (TwNV) to enhance the spatial reasoning capabilities of Large Multimodal Models (LMMs). This approach integrates generative novel-view synthesi…
-
New LithoBench benchmark reveals large multimodal model limitations
Researchers have introduced LithoBench, a new benchmark designed to evaluate the capabilities of large multimodal models in interpreting geological lithology from remote sensing data. This benchmark includes 10,000 expe…
-
New CC-OCR V2 benchmark reveals LMMs fall short in real-world document processing
A new benchmark, CC-OCR V2, has been released to evaluate Large Multimodal Models (LMMs) on real-world document processing tasks. The benchmark includes 7,093 challenging samples across five OCR-centric tracks, addressi…
-
New CSteer method guides large multimodal models to refer multiple regions without fine-tuning
Researchers have developed a new training-free method called Contextual Latent Steering (CSteer) to enhance the ability of Large Multimodal Models (LMMs) to accurately identify and refer to multiple specific regions wit…
-
VEBench benchmark evaluates large multimodal models for video editing tasks
Researchers have introduced VEBENCH, a new benchmark designed to evaluate Large Multimodal Models (LMMs) in real-world video editing tasks. The benchmark includes over 3.9K edited videos and 3,080 question-answer pairs,…
-
Tree-of-Evidence算法增强多模态AI的可解释性
研究人员开发了一种名为Tree-of-Evidence (ToE)的新方法,以提高大型多模态模型 (LMMs) 的可解释性。ToE将模型可解释性构建为一个优化问题,使用轻量级的“证据瓶颈”来识别预测的关键数据单元。这种方法在保持高预测性能的同时,允许进行可审计的证据追踪,仅用最少的证据单元就保留了完整模型98%以上的AUROC。
-
Researchers develop Glance-or-Gaze to improve LMM visual search with adaptive focus
Researchers have introduced Glance-or-Gaze (GoG), a new framework designed to improve Large Multimodal Models (LMMs) in handling knowledge-intensive visual queries. Unlike previous methods that retrieve information indi…
-
New benchmark UNIKIE-BENCH evaluates large multimodal models for document information extraction
Researchers have introduced UNIKIE-BENCH, a new benchmark designed to systematically evaluate the performance of Large Multimodal Models (LMMs) in extracting key information from visual documents. The benchmark features…