Brief

last 24h

[10/10] 224 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · arXiv cs.CV English(EN) · 1mo

FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

Researchers have introduced FIKA-Bench, a new benchmark designed to evaluate the ability of AI systems to acquire knowledge about unfamiliar objects, moving beyond simple visual recognition. The benchmark consists of 311 real-life instances that have been carefully curated to avoid leakage and ensure evidence grounding. Evaluations show that even state-of-the-art large multimodal models and agents struggle with this task, achieving only around 25% accuracy, highlighting the need for improved agent designs focused on fine-grained recognition and evidence verification. AI

IMPACT Introduces a benchmark to push AI beyond recognition towards active knowledge acquisition, potentially improving real-world object understanding.
RESEARCH · arXiv cs.CV English(EN) · 1mo · [2 sources]

Personal Visual Context Learning in Large Multimodal Models

Two new benchmarks, MMCL-Bench and Personal-VCL-Bench, have been introduced to evaluate the multimodal context learning capabilities of large language models. MMCL-Bench focuses on learning from visual rules, procedures, and evidence, while Personal-VCL-Bench assesses the ability of models to utilize user-specific visual context for personalized queries. Both benchmarks reveal significant limitations in current frontier multimodal models, indicating a substantial gap in their ability to effectively extract, reason over, and apply visual information. AI

IMPACT Highlights a critical bottleneck in current multimodal models, suggesting future research directions for personalized AI assistants.
TOOL · arXiv cs.CV English(EN) · 1mo

Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence

Researchers have introduced a new paradigm called Thinking with Novel Views (TwNV) to enhance the spatial reasoning capabilities of Large Multimodal Models (LMMs). This approach integrates generative novel-view synthesis into the LMM's reasoning process, allowing it to generate and analyze alternative viewpoints when faced with spatial ambiguity. Experiments demonstrated that precise camera-pose specifications are more effective than natural language for view control, and the quality of synthesized views directly impacts spatial accuracy. The TwNV method consistently improved accuracy across various LMM architectures and spatial reasoning tasks. AI

IMPACT Enhances LMMs' ability to understand spatial relationships, potentially improving applications in robotics and scene understanding.
TOOL · arXiv cs.CV English(EN) · 1mo

LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation

Researchers have introduced LithoBench, a new benchmark designed to evaluate the capabilities of large multimodal models in interpreting geological lithology from remote sensing data. This benchmark includes 10,000 expert-annotated instances across 12 lithological categories, structured into five cognitive levels from basic identification to complex reasoning. Experiments using LithoBench have revealed significant limitations in current large multimodal models, particularly in their ability to perform higher-order geological explanation, application, and reasoning tasks. AI

IMPACT This benchmark will help researchers identify and address the shortcomings of large multimodal models in specialized domains like geology.
- LithoBench
- large multimodal models
RESEARCH · arXiv cs.CL English(EN) · 1mo · [2 sources]

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

A new benchmark, CC-OCR V2, has been released to evaluate Large Multimodal Models (LMMs) on real-world document processing tasks. The benchmark includes 7,093 challenging samples across five OCR-centric tracks, addressing limitations of existing benchmarks that do not reflect practical application conditions. Experiments with 14 advanced LMMs showed significant performance degradation, highlighting a gap between current model capabilities and real-world requirements. AI

IMPACT Highlights a gap in LMM performance for real-world document processing, suggesting current models may not meet enterprise needs.
TOOL · arXiv cs.CV English(EN) · 1mo

Referring Multiple Regions with Large Multimodal Models via Contextual Latent Steering

Researchers have developed a new training-free method called Contextual Latent Steering (CSteer) to enhance the ability of Large Multimodal Models (LMMs) to accurately identify and refer to multiple specific regions within an image. This approach modifies the model's internal representations during inference, allowing it to better differentiate between regions and consider global context without requiring additional fine-tuning or architectural changes. Experiments on various datasets show that LMMs equipped with CSteer surpass specialized referring models, establishing a new state-of-the-art in visual referring tasks. AI

IMPACT Enhances visual referring capabilities of LMMs, potentially improving applications in image analysis and multimodal AI research.
RESEARCH · arXiv cs.CV English(EN) · 1mo · [2 sources]

VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

Researchers have introduced VEBENCH, a new benchmark designed to evaluate Large Multimodal Models (LMMs) in real-world video editing tasks. The benchmark includes over 3.9K edited videos and 3,080 question-answer pairs, focusing on recognizing editing techniques and simulating editing workflows. Experiments using VEBENCH revealed a significant performance gap between current LMMs and human capabilities in video editing, highlighting the need for improved multimodal reasoning and operational capabilities. AI

IMPACT Establishes a new standard for evaluating AI in video editing, potentially guiding future development of more capable creative AI tools.
RESEARCH · arXiv cs.LG English(EN) · 1mo

Tree-of-Evidence: Efficient "System 2" Search for Faithful Multimodal Grounding

Researchers have developed a new method called Tree-of-Evidence (ToE) to improve the interpretability of Large Multimodal Models (LMMs). ToE frames model interpretability as an optimization problem, using lightweight "Evidence Bottlenecks" to identify crucial data units for a prediction. This approach allows for auditable evidence traces while maintaining high predictive performance, retaining over 98% of the full model's AUROC with minimal evidence units. AI

IMPACT Provides a practical mechanism for auditing multimodal models by revealing discrete evidence units that support predictions.
RESEARCH · arXiv cs.CV English(EN) · 1mo

Glance-or-Gaze: Incentivizing LMMs to Adaptively Focus Search via Reinforcement Learning

Researchers have introduced Glance-or-Gaze (GoG), a new framework designed to improve Large Multimodal Models (LMMs) in handling knowledge-intensive visual queries. Unlike previous methods that retrieve information indiscriminately, GoG employs a Selective Gaze mechanism to adaptively focus on relevant image regions or global context. The framework is trained using a dual-stage approach, combining supervised fine-tuning with complexity-adaptive reinforcement learning to enhance iterative reasoning and performance on complex visual tasks. AI

IMPACT Introduces a novel adaptive search mechanism for LMMs, potentially improving efficiency and accuracy in complex visual query tasks.
RESEARCH · arXiv cs.CV English(EN) · 1mo

UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents

Researchers have introduced UNIKIE-BENCH, a new benchmark designed to systematically evaluate the performance of Large Multimodal Models (LMMs) in extracting key information from visual documents. The benchmark features two tracks: one for constrained-category KIE with predefined schemas and another for open-category KIE. Experiments using 15 state-of-the-art LMMs highlighted significant performance drops when dealing with varied schemas, long-tail information, and complex layouts, indicating ongoing challenges in accuracy and reasoning for LMMs in this domain. AI

IMPACT Provides a standardized evaluation framework for LMMs in document information extraction, highlighting current limitations.