Researchers have introduced ProMSA, a novel agent designed for Knowledge-Based Visual Question Answering (KB-VQA). Unlike previous methods that rely on fixed retrieval pipelines, ProMSA progressively selects between image search, text search, or stopping, with defined tool-call budgets and deduplication to prevent redundant searches. The agent is trained using a combination of rejection-sampling Supervised Fine-Tuning (SFT) for tool-use formats and a sequence-level Reinforcement Learning (RL) objective called TN-GSPO. Experiments on the E-VQA and InfoSeek datasets demonstrate that ProMSA achieves superior retrieval and end-to-end accuracy compared to existing retrieval-augmented generation (RAG) and agent baselines. AI
IMPACT This new agent could improve the accuracy and efficiency of AI systems that need to answer questions based on both visual and textual information.
RANK_REASON The cluster contains a research paper detailing a new model/agent for a specific AI task. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 3 sources. How we write summaries →