New ProMSA agent enhances knowledge-based visual question answering

By PulseAugur Editorial · [1 sources] · 2026-06-26 11:23

Researchers have introduced ProMSA, a novel agent designed for Knowledge-Based Visual Question Answering (KB-VQA). Unlike previous methods that rely on fixed retrieval pipelines, ProMSA progressively selects between image search, text search, or stopping, with defined tool-call budgets and deduplication to prevent redundant searches. The agent is trained using a combination of rejection-sampling Supervised Fine-Tuning (SFT) for tool-use formats and a sequence-level Reinforcement Learning (RL) objective called TN-GSPO. Experiments on the E-VQA and InfoSeek datasets demonstrate that ProMSA achieves superior retrieval and end-to-end accuracy compared to existing retrieval-augmented generation (RAG) and agent baselines. AI

IMPACT This new agent could improve the accuracy and efficiency of AI systems that need to answer questions based on both visual and textual information.

RANK_REASON The cluster contains a research paper detailing a new model/agent for a specific AI task. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New ProMSA agent enhances knowledge-based visual question answering

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Haoqian Wang · 2026-06-26 11:23

ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering

Knowledge-based Visual Question Answering (KB-VQA) requires models to combine image understanding with external knowledge. Most prior methods use a fixed retrieve-then-generate pipeline with a pre-selected retriever and a static top-k setting, which is not adaptive during reasoni…

COVERAGE [1]

ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering

RELATED ENTITIES

RELATED TOPICS