PulseAugur
EN
LIVE 15:24:48

New benchmarks and methods advance multimodal LLM capabilities

Researchers are developing new methods for multimodal large language models (MLLMs) to improve their understanding of sequential audio-video data and large-scale visual recognition. One approach, DLLM-VSR, uses diffusion models for visual speech recognition, achieving state-of-the-art results by iteratively denoising and decoding transcriptions. Another paper introduces SONIC-O1, a benchmark for evaluating MLLMs on real-world audio-video understanding, highlighting performance disparities across demographic groups. Additionally, new techniques are being explored for efficient training and inference of MLLMs, including heterogeneous parallelism for training and a divide-and-conquer strategy for inference to combat performance degradation with expanding label spaces. AI

IMPACT Advances in multimodal LLMs promise improved performance in audio-video understanding, speech recognition, and large-scale visual tasks.

RANK_REASON Multiple research papers introducing new models, benchmarks, and training/inference techniques for multimodal large language models.

Read on arXiv cs.LG →

AI-generated summary · Google Gemini · from 7 sources. How we write summaries →

New benchmarks and methods advance multimodal LLM capabilities

COVERAGE [7]

  1. arXiv cs.AI TIER_1 English(EN) · Jeong Hun Yeo, Chae Won Kim, Hyeongseop Rha, Yong Man Ro ·

    Diffusion Large Language Models for Visual Speech Recognition

    arXiv:2605.28456v1 Announce Type: new Abstract: Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, t…

  2. arXiv cs.AI TIER_1 English(EN) · Ahmed Y. Radwan, Christos Emmanouilidis, Hina Tabassum, Deval Pandya, Shaina Raza ·

    SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

    arXiv:2601.21666v2 Announce Type: replace Abstract: Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. …

  3. arXiv cs.LG TIER_1 English(EN) · Yashaswi Karnati, Kamran Jafari, Akash Mehra, Li Ding, Pranav Prashant Thombre, Ali Roshan Ghias, Shifang Xu, Parth Mannan, Yu Yao, Hao Wu, Eric Harper, Ashwath Aithal, Nima Tajbakhsh ·

    Heterogeneous Parallelism for Multimodal Large Language Model Training

    arXiv:2605.27678v1 Announce Type: new Abstract: Foundation model training is becoming multimodal, from post-training pipelines to large-scale pretraining. As modality coverage broadens, context windows grow, and encoder LLM scales diverge, a single LLM-centric TP/CP/PP/DP/EP layo…

  4. arXiv cs.AI TIER_1 English(EN) · Zhipeng Ye, Jiaqi Huang, Feng Jiang, Qiufeng Wang, Yikang Duan, Dawei Wang, Xihang Zhou, Qian Qiao ·

    Divide-and-Conquer Inference for Large-Scale Visual Recognition with Multimodal Large Language Models

    arXiv:2605.24799v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a wide range of vision language tasks. However, when applied to large scale image classification, their performance degrades significantly as th…

  5. arXiv cs.LG TIER_1 English(EN) · Yulin Yuan, Hongshuo Zhao, Xiangming Meng ·

    Visual-Redundancy-Controlled Parallel Decoding for Diffusion-Based Multimodal Large Language Models

    arXiv:2605.25820v1 Announce Type: new Abstract: Diffusion-based multimodal large language models (dMLLMs) decode by iteratively predicting tokens at multiple masked positions in parallel. This turns each decoding step into a position-selection problem: the model must choose not o…

  6. arXiv cs.LG TIER_1 English(EN) · Xiangming Meng ·

    Visual-Redundancy-Controlled Parallel Decoding for Diffusion-Based Multimodal Large Language Models

    Diffusion-based multimodal large language models (dMLLMs) decode by iteratively predicting tokens at multiple masked positions in parallel. This turns each decoding step into a position-selection problem: the model must choose not only which predictions are reliable in isolation,…

  7. arXiv cs.CV TIER_1 English(EN) · Yong Man Ro ·

    Diffusion Large Language Models for Visual Speech Recognition

    Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, to the best of our knowledge, the first Diffusion…