VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio
Researchers have developed VL-DINO, a new object detection model that effectively integrates knowledge from CLIP, a vision-language model. The model uses novel modules to construct better training samples and fuse visual and textual information. In zero-shot tests on the LVIS benchmark, VL-DINO achieved state-of-the-art results, outperforming previous methods. AI
IMPACT Sets new SOTA on zero-shot object detection benchmarks, potentially improving image analysis capabilities.