VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA
Researchers have introduced VinQA, a new dataset designed to improve multimodal large language models (MLLMs) for question answering in real-world documents. Unlike previous models that often produce text-only responses, VinQA focuses on generating long-form answers that integrate cited visual elements like images and charts with supporting text. The study also explores two encoding methods for document page images and proposes M-GroSE, a multimodal evaluation framework to assess answer quality, including visual citation accuracy. AI
IMPACT Enhances multimodal LLMs' ability to process and generate answers incorporating visual elements from documents.