Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 8h

VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA

Researchers have introduced VinQA, a new dataset designed to improve multimodal large language models (MLLMs) for question answering in real-world documents. Unlike previous models that often produce text-only responses, VinQA focuses on generating long-form answers that integrate cited visual elements like images and charts with supporting text. The study also explores two encoding methods for document page images and proposes M-GroSE, a multimodal evaluation framework to assess answer quality, including visual citation accuracy. AI

IMPACT Enhances multimodal LLMs' ability to process and generate answers incorporating visual elements from documents.

arXiv
MLLMs
Qwen2.5-VL
Page Encoding
Modality Encoding
M-GroSE
Visual G-Eval