Brief · PulseAugur

TOOL · arXiv cs.LG English(EN) · 1d

Pix2Fact: When Vision Is Not Enough -- Benchmarking Fine-Grained VQA with Web Verification on High-Resolution Real-World Scenes

A new benchmark called Pix2Fact has been introduced to evaluate the capabilities of vision-language models (VLMs) in tasks requiring both fine-grained visual understanding and external knowledge integration. The benchmark, featuring 1,000 high-resolution images and questions crafted by PhD-level experts, proved challenging for current state-of-the-art models. Even advanced VLMs like Gemini 3.1 Pro achieved only 51.7% accuracy, highlighting limitations in visual grounding, knowledge search, and retrieval of unstructured information. Pix2Fact aims to drive the development of next-generation AI agents that can better combine perception with knowledge. AI

IMPACT Pix2Fact benchmark highlights current VLM weaknesses, pushing for agents that better integrate perception and knowledge retrieval.

GPT-5.4
Gemini 3.1 Pro
Pix2Fact
Cong Zhang