研究人员正在开发新的多模态大语言模型(MLLMs)方法,以提高它们对顺序音视频数据和大规模视觉识别的理解能力。一种方法DLLM-VSR使用扩散模型进行视觉语音识别,通过迭代去噪和解码转录文本取得了最先进的成果。另一篇论文介绍了SONIC-O1,这是一个用于评估MLLMs在真实世界音视频理解能力的基准,突出了不同人口群体之间的性能差异。此外,还在探索用于MLLMs高效训练和推理的新技术,包括用于训练的异构并行以及用于推理的“分而治之”策略,以应对标签空间扩展导致的性能下降问题。
AI
arXiv:2605.28456v1 Announce Type: new Abstract: Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, t…
arXiv cs.AI
TIER_1English(EN)·Ahmed Y. Radwan, Christos Emmanouilidis, Hina Tabassum, Deval Pandya, Shaina Raza·
arXiv:2601.21666v2 Announce Type: replace Abstract: Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. …
arXiv cs.LG
TIER_1English(EN)·Yashaswi Karnati, Kamran Jafari, Akash Mehra, Li Ding, Pranav Prashant Thombre, Ali Roshan Ghias, Shifang Xu, Parth Mannan, Yu Yao, Hao Wu, Eric Harper, Ashwath Aithal, Nima Tajbakhsh·
arXiv:2605.27678v1 Announce Type: new Abstract: Foundation model training is becoming multimodal, from post-training pipelines to large-scale pretraining. As modality coverage broadens, context windows grow, and encoder LLM scales diverge, a single LLM-centric TP/CP/PP/DP/EP layo…
arXiv:2605.24799v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a wide range of vision language tasks. However, when applied to large scale image classification, their performance degrades significantly as th…
arXiv:2605.25820v1 Announce Type: new Abstract: Diffusion-based multimodal large language models (dMLLMs) decode by iteratively predicting tokens at multiple masked positions in parallel. This turns each decoding step into a position-selection problem: the model must choose not o…
Diffusion-based multimodal large language models (dMLLMs) decode by iteratively predicting tokens at multiple masked positions in parallel. This turns each decoding step into a position-selection problem: the model must choose not only which predictions are reliable in isolation,…
Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, to the best of our knowledge, the first Diffusion…