FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs
Researchers have introduced FindIt, a new benchmark designed to evaluate the promptable localization abilities of generalist multimodal large language models (MLLMs). This benchmark covers object detection, referring expression detection, instance-level detection, and video-based detection, standardizing inputs and outputs for fair evaluation. Initial assessments of various MLLMs reveal significant limitations, particularly in adhering to specific output formatting requirements, highlighting areas for future model development and evaluation improvements. AI
IMPACT Establishes a new standard for evaluating MLLMs in localization tasks, potentially guiding future model development towards better adherence to structured outputs.