PulseAugur
EN
LIVE 06:17:10

New ProMSA agent enhances knowledge-based visual question answering

Researchers have introduced ProMSA, a novel agent designed for Knowledge-Based Visual Question Answering (KB-VQA). Unlike previous methods that rely on fixed retrieval pipelines, ProMSA progressively selects between image search, text search, or stopping, with defined tool-call budgets and deduplication to prevent redundant searches. The agent is trained using a combination of rejection-sampling Supervised Fine-Tuning (SFT) for tool-use formats and a sequence-level Reinforcement Learning (RL) objective called TN-GSPO. Experiments on the E-VQA and InfoSeek datasets demonstrate that ProMSA achieves superior retrieval and end-to-end accuracy compared to existing retrieval-augmented generation (RAG) and agent baselines. AI

IMPACT This new agent could improve the accuracy and efficiency of AI systems that need to answer questions based on both visual and textual information.

RANK_REASON The cluster contains a research paper detailing a new model/agent for a specific AI task. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

New ProMSA agent enhances knowledge-based visual question answering

COVERAGE [3]

  1. arXiv cs.AI TIER_1 English(EN) · ZhengXian Wu, Hangrui Xu, Kai Shi, Zhuohong Chen, Yunyao Yu, Chuanrui Zhang, Zirui Liao, Jun Yang, Zhenyu Yang, Haonan Lu, Haoqian Wang ·

    ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering

    arXiv:2606.27974v1 Announce Type: cross Abstract: Knowledge-based Visual Question Answering (KB-VQA) requires models to combine image understanding with external knowledge. Most prior methods use a fixed retrieve-then-generate pipeline with a pre-selected retriever and a static t…

  2. arXiv cs.AI TIER_1 English(EN) · Haoqian Wang ·

    ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering

    Knowledge-based Visual Question Answering (KB-VQA) requires models to combine image understanding with external knowledge. Most prior methods use a fixed retrieve-then-generate pipeline with a pre-selected retriever and a static top-k setting, which is not adaptive during reasoni…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering

    A progressive multimodal search agent for knowledge-based visual question answering that adaptively selects search strategies and optimizes through sequence-level reinforcement learning.