Brief

last 24h

[18/18] 222 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · arXiv cs.CV English(EN) · 6d

TextSculptor: Training and Benchmarking Scene Text Editing

Researchers have introduced TextSculptor, a new framework designed to improve scene text editing in images. This framework includes an automated data construction pipeline that generates a large dataset of 3.2 million samples for text-to-image synthesis and text editing tasks. Additionally, TextSculptor provides a benchmark suite covering four core editing functions: addition, replacement, removal, and hybrid editing, aiming to enhance the performance of open-source models in this domain. AI

IMPACT Enhances open-source capabilities for precise text manipulation in images, potentially improving applications like content creation and accessibility tools.
TOOL · arXiv cs.CV English(EN) · 4d

MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

Researchers have identified a temporal grounding issue in multimodal large language models (MLLMs) where the models understand event timing during an initial phase but lose this signal during answer generation. They discovered specific attention heads, termed Temporal Grounding Heads (TG-Heads), that focus on the correct time intervals in videos during prefill. To address this, they developed an inference-time framework that leverages these TG-Heads to extract the relevant interval and then re-invokes the model with restricted visual context, improving performance on video temporal grounding benchmarks without requiring model retraining. AI

IMPACT Improves multimodal LLM accuracy on video temporal grounding tasks by addressing a key perception-generation gap without retraining.
- MLLMs
- Qwen3-VL-8B
- TG-Heads
- TimeLens-8B
- MiMo-VL-7B
TOOL · arXiv cs.CV Dansk(DA) · 4d

Universal Skeleton Understanding via Differentiable Rendering and MLLMs

Researchers have developed SkeletonLLM, a novel approach to enable multimodal large language models (MLLMs) to understand structured, non-visual data like human skeletons. The system uses DrAction, a differentiable renderer that converts skeletal motion into image sequences, allowing MLLMs to process this data directly. This method facilitates open-vocabulary action recognition, motion captioning, and question answering across diverse skeleton formats, suggesting a path for MLLMs to engage with non-native data types. AI

IMPACT Enables LLMs to process structured, non-visual data like human skeletons, expanding their application scope.
- MLLMs
- Ziyi Wang
- SkeletonLLM
- DrAction
TOOL · arXiv cs.CV English(EN) · 4d

ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

Researchers have developed ST-SimDiff, a novel framework designed to make multimodal large language models (MLLMs) more efficient at processing long videos. The method addresses the computational burden by focusing on both static redundancy and dynamic changes within video content. ST-SimDiff utilizes a spatio-temporal graph to model token associations, employing a dual-selection strategy that identifies representative tokens for static information and key turning points for dynamic content. Experiments indicate that this approach significantly outperforms existing methods while reducing computational costs. AI

IMPACT Enhances efficiency for MLLMs processing video, potentially enabling broader applications with longer video inputs.
TOOL · arXiv cs.LG English(EN) · 4d

Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis

Researchers have developed a new pipeline to improve the ability of multimodal large language models (MLLMs) to analyze safety-critical driving events. This pipeline fuses downsampled video frames with telematics data and insights from specialized computer vision models to create high-quality training data. By fine-tuning the open-source QwenVL-2.5 model using this data, they achieved significant improvements in identifying and explaining safety-critical events with a limited computational budget. AI

IMPACT Enhances AI's ability to analyze complex, safety-critical visual data, potentially improving autonomous driving systems.
TOOL · arXiv cs.LG English(EN) · 4d

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Researchers have introduced MapTab, a new benchmark designed to evaluate the multi-criteria reasoning abilities of multimodal large language models (MLLMs). This benchmark utilizes route planning tasks that combine visual map data with structured tabular information on criteria such as time and price. MapTab includes two scenarios, Metromap and Travelmap, featuring extensive datasets of maps, queries, and questions to challenge MLLMs. Initial evaluations indicate that current MLLMs struggle with these complex multimodal reasoning tasks, sometimes underperforming unimodal approaches when visual perception is limited. AI

IMPACT Establishes a new evaluation standard for multimodal LLMs, pushing for more robust reasoning capabilities beyond current benchmarks.
- Ziqiao Shang
- AGI
- MLLMs
- MapTab
TOOL · arXiv cs.CV English(EN) · 4d

Dissecting Embodied Abilities in Multimodal Language Models through Skill-level Evaluation and Diagnosis

Researchers have introduced BEAR, a new benchmark designed to evaluate and diagnose the skill-level capabilities of embodied multimodal large language models (MLLMs). This benchmark decomposes embodied tasks into 14 distinct atomic skills, providing more granular insights into model failures than previous task-level evaluations. Evaluations on BEAR revealed that perceptual limitations and unstable spatiotemporal modeling are significant bottlenecks for current MLLMs. To address these issues, the team developed BEAR-Agent, a conversational agent that enhances MLLMs with visual and spatial reasoning tools, demonstrating substantial performance improvements on the benchmark and in robotic experiments. AI

IMPACT Identifies key weaknesses in embodied AI, guiding future research towards improved perception and spatiotemporal reasoning for robotic agents.
- GPT-5
- MLLMs
- BEAR
- BEAR-Agent
RESEARCH · arXiv cs.CL English(EN) · 5d · [2 sources]

Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs

Researchers have developed a novel method for detecting AI-generated modern Chinese poetry by integrating image semantics with text analysis. This approach uses images related to the poem's content to provide complementary information, enhancing the judgment process. Experiments show that this image-semantic guided method significantly outperforms traditional text-based detection, with a Gemini-based detector achieving a state-of-the-art Macro-F1 score of 85.65%. AI

IMPACT This method could improve AI-generated text detection, particularly for creative content like poetry.
- MLLMs
- RoBERTa
- Gemini
TOOL · arXiv cs.AI English(EN) · 5d

AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

Researchers have introduced AtelierEval, a novel benchmark designed to evaluate the proficiency of both humans and multimodal large language models (MLLMs) in generating effective text-to-image prompts. This benchmark, which includes 360 expert-crafted tasks, aims to quantify the quality of prompts used to translate user intent into detailed instructions for text-to-image systems. AtelierEval also features AtelierJudge, an agentic evaluator that correlates strongly with human expert assessments, and its experiments reveal that mimicry-based prompting may be more effective than planning-based approaches for future prompters. AI

IMPACT Introduces a new evaluation framework for text-to-image prompting, enabling better assessment of both human and AI prompter capabilities.
RESEARCH · arXiv cs.AI English(EN) · 6d · [3 sources]

ACL-Verbatim: hallucination-free question answering for research

Two new research papers address the critical issue of AI hallucinations in different domains. One paper introduces ACL-Verbatim, an extractive question-answering system designed to provide hallucination-free answers from research papers by mapping queries to verbatim text spans. The other paper, VIHD, proposes a visual intervention-based method for detecting hallucinations in medical visual question-answering models by analyzing cross-modal dependencies between text and visual tokens. AI

IMPACT These papers offer new techniques to improve the reliability of AI systems in research and medical applications, reducing risks associated with inaccurate information.
- ModernBERT
- LLMs
- arXiv
- MLLMs
- ACL-Verbatim
RESEARCH · arXiv cs.AI English(EN) · 4d · [3 sources]

PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

Researchers have developed new methods to improve visual grounding in multimodal large language models (MLLMs). One approach, PGT, uses procedurally generated tasks with geometric primitives to provide denser supervision, leading to significant gains on various benchmarks. Another development, AgroVG, introduces a large-scale benchmark specifically for agricultural visual grounding, highlighting current model limitations in complex scenarios. AI

IMPACT Advances in visual grounding are crucial for enabling more sophisticated AI applications in areas like agriculture and general perception tasks.
RESEARCH · Hugging Face Daily Papers English(EN) · 6d · [3 sources]

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

Researchers have introduced LatentOmni, a novel framework designed to enhance omnimodal understanding by unifying audio-visual reasoning within a latent space. This approach aims to overcome limitations in current multimodal large language models (MLLMs) that struggle with fine-grained temporal grounding. LatentOmni interleaves textual reasoning with continuous audio-visual latent states, preserving sensory information and improving temporal consistency through techniques like Omni-Sync Position Embedding. The framework is supported by a new dataset, LatentOmni-Instruct-35K, and has demonstrated superior performance on audio-visual reasoning benchmarks compared to existing open-source models. AI

IMPACT Enhances omnimodal understanding by improving audio-visual reasoning in LLMs, potentially leading to more robust AI systems.
RESEARCH · Hugging Face Daily Papers English(EN) · 6d · [2 sources]

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

Researchers have introduced a new benchmark and dataset called MM-OCEAN to evaluate how well multimodal large language models (MLLMs) can reason about personality. The study found that a significant portion of MLLMs, over 51%, provide correct personality assessments without grounding their judgments in observable behavioral evidence. This "Prejudice Gap" highlights a disconnect between accurate predictions and genuine understanding, suggesting a need for more robust evaluation methods for social cognition in AI. AI

IMPACT Highlights a critical flaw in current MLLM evaluations, potentially impacting their deployment in human-facing roles and guiding future safety research.
TOOL · arXiv cs.IR (Information Retrieval) English(EN) · 4d

From Head to Tail: Asymmetric Knowledge Transfer in Long-tail Recommendation with Generative Semantic IDs

Researchers have developed a new framework called AKT-Rec to address challenges in long-tail recommendation systems, particularly those in e-commerce platforms with significant data imbalance. This framework utilizes multimodal large language models (MLLMs) to generate semantic IDs that align content features with collaborative filtering signals. AKT-Rec incorporates an asymmetric contrastive objective and an activity-aware gating mechanism to facilitate knowledge transfer from head to tail items, improving representation learning. Experiments on a large-scale industrial dataset and subsequent online A/B testing on Alibaba's Tmall platform demonstrated substantial improvements in key metrics such as AUC, GAUC, CTR, and GMV. AI

IMPACT Enhances e-commerce recommendation systems by improving CTR and GMV through better handling of data imbalance.
RESEARCH · Hugging Face Daily Papers Italiano(IT) · 1w · [6 sources]

Q-ARVD: Quantizing Autoregressive Video Diffusion Models

Researchers have developed several new techniques to improve video diffusion models, focusing on efficiency and quality. One approach, LocalDPO, optimizes alignment at a localized spatio-temporal region level for better video fidelity and coherence. Another method, ARL2, replaces quadratic self-attention with a fixed-size recurrent state to achieve linear time scaling and constant memory usage, speeding up generation and reducing memory requirements. Additionally, ORBIS is an SW-HW co-designed accelerator that uses output activation for more accurate inter-token similarity, leading to higher token reduction ratios and significant speedup and energy reduction. Finally, Bernini unifies multimodal large language models (MLLMs) with diffusion models, using MLLMs for semantic planning and diffusion models for pixel rendering, achieving state-of-the-art performance in video generation and editing. AI

IMPACT These advancements in video diffusion models promise more efficient and higher-quality video generation, potentially impacting creative industries and AI-driven content creation.
TOOL · Hugging Face Daily Papers English(EN) · 1w

EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs

Researchers have introduced EgoCoT-Bench, a new benchmark designed to evaluate the reasoning capabilities of Multimodal Large Language Models (MLLMs) when processing egocentric video data. This benchmark specifically focuses on the models' ability to understand hand-object interactions, track object states, and reason about manipulative processes using first-person video perspectives. EgoCoT-Bench aims to address limitations in existing benchmarks by providing explicit, step-by-step rationale annotations grounded in spatio-temporal evidence, revealing that many current MLLMs generate correct answers with inconsistent supporting evidence. AI

IMPACT Provides a new evaluation tool to push MLLMs towards more verifiable and grounded reasoning in video understanding tasks.
RESEARCH · arXiv cs.MA (Multiagent) English(EN) · 1w · [8 sources]

PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning

Researchers have introduced two new benchmarks, VGenST-Bench and CaST-Bench, designed to more rigorously evaluate the spatio-temporal reasoning capabilities of Multimodal Large Language Models (MLLMs) and Vision-Language Models (VLMs). VGenST-Bench utilizes active video synthesis to create controlled scenarios across various spatial and temporal dimensions, enabling fine-grained diagnosis of MLLM understanding. CaST-Bench focuses on causal chain-grounded spatio-temporal reasoning, requiring models to identify and localize evidence for cause-and-effect relationships in videos, highlighting current VLM limitations in this area. AI

IMPACT These benchmarks aim to improve the evaluation of AI models' understanding of real-world scenarios, pushing for more robust spatio-temporal and causal reasoning.
RESEARCH · arXiv cs.AI English(EN) · 2w · [2 sources]

CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs

Two new research papers highlight challenges in developing AI for non-English languages and cultures. One paper reflects on two decades of building Arabic NLP resources, concluding that social and institutional factors are harder to overcome than linguistic ones. The other paper introduces a benchmark for evaluating how well Multimodal Large Language Models (MLLMs) can adapt to different cultures without negatively impacting their performance in other cultural contexts. AI

IMPACT Highlights the need for more culturally aware and linguistically diverse AI models, suggesting current approaches struggle with cross-cultural adaptation.