Researchers are developing new methods for multimodal large language models (MLLMs) to improve their understanding of sequential audio-video data and large-scale visual recognition. One approach, DLLM-VSR, uses diffusion models for visual speech recognition, achieving state-of-the-art results by iteratively denoising and decoding transcriptions. Another paper introduces SONIC-O1, a benchmark for evaluating MLLMs on real-world audio-video understanding, highlighting performance disparities across demographic groups. Additionally, new techniques are being explored for efficient training and inference of MLLMs, including heterogeneous parallelism for training and a divide-and-conquer strategy for inference to combat performance degradation with expanding label spaces. AI
IMPACT Advances in multimodal LLMs promise improved performance in audio-video understanding, speech recognition, and large-scale visual tasks.
RANK_REASON Multiple research papers introducing new models, benchmarks, and training/inference techniques for multimodal large language models.
- Diffusion Large Language Models
- Divide-and-Conquer Inference
- DLLM-VSR
- Megatron-LM
- Multimodal Large Language Models
- SONIC-O1
- Visual-Redundancy-Controlled Decoding
AI-generated summary · Google Gemini · from 7 sources. How we write summaries →