New benchmarks and methods advance multimodal LLM capabilities

By PulseAugur Editorial · [7 sources] · 2026-05-25 13:16

Researchers are developing new methods for multimodal large language models (MLLMs) to improve their understanding of sequential audio-video data and large-scale visual recognition. One approach, DLLM-VSR, uses diffusion models for visual speech recognition, achieving state-of-the-art results by iteratively denoising and decoding transcriptions. Another paper introduces SONIC-O1, a benchmark for evaluating MLLMs on real-world audio-video understanding, highlighting performance disparities across demographic groups. Additionally, new techniques are being explored for efficient training and inference of MLLMs, including heterogeneous parallelism for training and a divide-and-conquer strategy for inference to combat performance degradation with expanding label spaces. AI

IMPACT Advances in multimodal LLMs promise improved performance in audio-video understanding, speech recognition, and large-scale visual tasks.

RANK_REASON Multiple research papers introducing new models, benchmarks, and training/inference techniques for multimodal large language models.

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 7 sources. How we write summaries →

New benchmarks and methods advance multimodal LLM capabilities

COVERAGE [7]

arXiv cs.AI TIER_1 English(EN) · Jeong Hun Yeo, Chae Won Kim, Hyeongseop Rha, Yong Man Ro · 2026-05-28 04:00

Diffusion Large Language Models for Visual Speech Recognition

arXiv:2605.28456v1 Announce Type: new Abstract: Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, t…
arXiv cs.AI TIER_1 English(EN) · Ahmed Y. Radwan, Christos Emmanouilidis, Hina Tabassum, Deval Pandya, Shaina Raza · 2026-05-28 04:00

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

arXiv:2601.21666v2 Announce Type: replace Abstract: Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. …
arXiv cs.LG TIER_1 English(EN) · Yashaswi Karnati, Kamran Jafari, Akash Mehra, Li Ding, Pranav Prashant Thombre, Ali Roshan Ghias, Shifang Xu, Parth Mannan, Yu Yao, Hao Wu, Eric Harper, Ashwath Aithal, Nima Tajbakhsh · 2026-05-28 04:00

Heterogeneous Parallelism for Multimodal Large Language Model Training

arXiv:2605.27678v1 Announce Type: new Abstract: Foundation model training is becoming multimodal, from post-training pipelines to large-scale pretraining. As modality coverage broadens, context windows grow, and encoder LLM scales diverge, a single LLM-centric TP/CP/PP/DP/EP layo…
arXiv cs.AI TIER_1 English(EN) · Zhipeng Ye, Jiaqi Huang, Feng Jiang, Qiufeng Wang, Yikang Duan, Dawei Wang, Xihang Zhou, Qian Qiao · 2026-05-26 04:00

Divide-and-Conquer Inference for Large-Scale Visual Recognition with Multimodal Large Language Models

arXiv:2605.24799v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a wide range of vision language tasks. However, when applied to large scale image classification, their performance degrades significantly as th…
arXiv cs.LG TIER_1 English(EN) · Yulin Yuan, Hongshuo Zhao, Xiangming Meng · 2026-05-26 04:00

Visual-Redundancy-Controlled Parallel Decoding for Diffusion-Based Multimodal Large Language Models

arXiv:2605.25820v1 Announce Type: new Abstract: Diffusion-based multimodal large language models (dMLLMs) decode by iteratively predicting tokens at multiple masked positions in parallel. This turns each decoding step into a position-selection problem: the model must choose not o…
arXiv cs.LG TIER_1 English(EN) · Xiangming Meng · 2026-05-25 13:16

Visual-Redundancy-Controlled Parallel Decoding for Diffusion-Based Multimodal Large Language Models

Diffusion-based multimodal large language models (dMLLMs) decode by iteratively predicting tokens at multiple masked positions in parallel. This turns each decoding step into a position-selection problem: the model must choose not only which predictions are reliable in isolation,…
arXiv cs.CV TIER_1 English(EN) · Yong Man Ro · 2026-05-27 13:22

Diffusion Large Language Models for Visual Speech Recognition

Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, to the best of our knowledge, the first Diffusion…

COVERAGE [7]

Diffusion Large Language Models for Visual Speech Recognition

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

Heterogeneous Parallelism for Multimodal Large Language Model Training

Divide-and-Conquer Inference for Large-Scale Visual Recognition with Multimodal Large Language Models

Visual-Redundancy-Controlled Parallel Decoding for Diffusion-Based Multimodal Large Language Models

Visual-Redundancy-Controlled Parallel Decoding for Diffusion-Based Multimodal Large Language Models

Diffusion Large Language Models for Visual Speech Recognition

RELATED ENTITIES

RELATED TOPICS