Two new arXiv surveys offer comprehensive overviews of visual reasoning tasks in computer vision. The first paper details Knowledge-based Vision Question Answering (KB-VQA) systems, categorizing them by knowledge representation, retrieval, and reasoning, and highlighting the impact of large language models (LLMs) on the field. The second survey provides a taxonomy of visual reasoning, breaking it down into five types: relational, symbolic, temporal, causal, and commonsense, and examining various methodologies including LLMs and multimodal large language models (MLLMs). Both papers identify persistent challenges and outline future research directions for advancing these AI capabilities. AI
IMPACT These surveys consolidate current research, identify key challenges, and propose future directions for visual reasoning and knowledge-based VQA systems.
RANK_REASON Two academic papers published on arXiv provide comprehensive surveys of specific AI research areas.
- arXiv
- Hugging Face
- Jiaqi Deng
- Knowledge-based Vision Question Answering
- large-language models
- multimodal large language models
- Vision Question Answering
- Zhenyu Yu
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →