Researchers have developed a new method called Visually-Anchored Policy Optimization (VAPO) to improve speech recognition in the presence of visual slide content. Omni-modal large language models (OLLMs) often suffer from "visual interference," where they hallucinate spoken words from visible text. VAPO addresses this by decoupling the model's process into distinct visual prior extraction and transcription generation steps, mimicking human "Look-then-Listen" behavior. This approach, along with a new benchmark called SlideASR-Bench, significantly reduces errors in entity recognition for specialized domains. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Introduces a novel approach to mitigate visual interference in multimodal speech recognition, potentially improving accuracy in presentation-based ASR systems.
RANK_REASON This is a research paper introducing a new method and benchmark for speech recognition.