Brief · PulseAugur

TOOL · arXiv cs.CV English(EN) · 8h

VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio

Researchers have developed VL-DINO, a new object detection model that effectively integrates knowledge from CLIP, a vision-language model. The model uses novel modules to construct better training samples and fuse visual and textual information. In zero-shot tests on the LVIS benchmark, VL-DINO achieved state-of-the-art results, outperforming previous methods. AI

IMPACT Sets new SOTA on zero-shot object detection benchmarks, potentially improving image analysis capabilities.

LVIS benchmark
VL-DINO