VAPO model tackles visual interference in speech recognition with novel 'Look-then-Listen' approach

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed a new method called Visually-Anchored Policy Optimization (VAPO) to improve speech recognition in the presence of visual slide content. Omni-modal large language models (OLLMs) often suffer from "visual interference," where they hallucinate spoken words from visible text. VAPO addresses this by decoupling the model's process into distinct visual prior extraction and transcription generation steps, mimicking human "Look-then-Listen" behavior. This approach, along with a new benchmark called SlideASR-Bench, significantly reduces errors in entity recognition for specialized domains. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Introduces a novel approach to mitigate visual interference in multimodal speech recognition, potentially improving accuracy in presentation-based ASR systems.

RANK_REASON This is a research paper introducing a new method and benchmark for speech recognition.

Read on arXiv cs.CV →

COVERAGE [1]

arXiv cs.CV TIER_1 · Rui Hu, Delai Qiu, Yining Wang, Shengping Liu, Jitao Sang · 2026-04-28 04:00

VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models

arXiv:2510.08618v2 Announce Type: replace-cross Abstract: Omni-modal large language models (OLLMs) offer a promising end-to-end solution for slide-enhanced speech recognition due to their inherent multimodal capabilities. However, we found a fundamental issue faced by OLLMs: \tex…

COVERAGE [1]

VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models

RELATED TOPICS