Abstract
The detection of unknown objects is a challenging task in computer vision because, although there are diverse real-world detection object categories, existing object-detection training sets cover a limited number of object categories . Most existing approaches use two-stage networks to improve a model’s ability to characterize objects of unknown classes, which leads to slow inference. To address this issue, we proposed a single-stage unknown object detection method based on the contrastive language-image pre-training (CLIP) model and pseudo-labelling, called CLIP-YOLO. First, a visual language embedding alignment method is introduced and a channel-grouped enhanced coordinate attention module is embedded into a YOLO-series detection head and feature-enhancing component, to improve the model’s ability to characterize and detect unknown category objects. Second, the pseudo-labelling generation is optimized based on the CLIP model to expand the diversity of the training set and enhance the ability to cover unknown object categories. We validated this method on four challenging datasets: MSCOCO, ILSVRC, Visual Genome, and PASCAL VOC. The results show that our method can achieve higher accuracy and faster speed, so as to obtain better performance of unknown object detection. The source code is available at https://github.com/BJUTsipl/CLIP-YOLO.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.
References
Zou Z, Chen K, Shi Z, Guo Y, Ye J (2023) Object detection in 20 years: a survey. Proceedings of the IEEE
Liang W, Xue F, Liu Y, Zhong G, Ming A (2023) Unknown sniffer for object detection: Don’t turn a blind eye to unknown objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3230–3239
Bansal A, Sikka K, Sharma G, Chellappa R, Divakaran A (2018) Zero-shot object detection. In: Proceedings of the European conference on computer vision (ECCV), pp 384–400
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763
Zareian A, Rosa KD, Hu DH, Chang S-F (2021) Open-vocabulary object detection using captions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14393–14402
Zhao S, Zhang Z, Schulter S, Zhao L, Vijay Kumar B, Stathopoulos A, Chandraker M, Metaxas DN (2022) Exploiting unlabeled data with vision and language models for object detection. In: European conference on computer vision, pp 159–175. Springer
Xie Q, Luong M-T, Hovy E, Le QV (2020) Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10687–10698
Tang Y, Chen W, Luo Y, Zhang Y (2021) Humble teachers teach better students for semi-supervised object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3132–3141
Rahman S, Khan S, Barnes N (2020) Improved visual-semantic alignment for zero-shot object detection. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 11932–11939
Zhao S, Gao C, Shao Y, Li L, Yu C, Ji Z, Sang N (2020) Gtnet: Generative transfer network for zero-shot object detection. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 12967–12974
Zheng Y, Wu J, Qin Y, Zhang F, Cui L (2021) Zero-shot instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2593–2602
Zhang L, Zhang C, Zhao J, Guan J, Zhou S (2023) Meta-zsdetr: Zero-shot detr with meta-learning. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 6845–6854
Liu H, Zhang L, Guan J, Zhou S (2023) Zero-shot object detection by semantics-aware detr with adaptive contrastive loss. In: Proceedings of the 31st ACM international conference on multimedia, pp 4421–4430
He S, Ding H, Jiang W (2023) Semantic-promoted debiasing and background disambiguation for zero-shot instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19498–19507
Khandelwal S, Nambirajan A, Siddiquie B, Eledath J, Sigal L (2023) Frustratingly simple but effective zero-shot detection and segmentation: analysis and a strong baseline. ArXiv arxiv:2302.07319
He S, Ding H, Jiang W (2023) Primitive generation and semantic-related alignment for universal zero-shot segmentation. In 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 11238–11247
Huang P, Zhang D, Cheng D, Han L, Zhu P, Han J (2024) M-RRFs: A memory-based robust region feature synthesizer for zero-shot object detection. Int J Comput Vis. https://doi.org/10.1007/s11263-024-02112-9
Gu X, Lin T-Y, Kuo W, Cui Y (2021) Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921
Zhong Y, Yang J, Zhang P, Li C, Codella N, Li LH, Zhou L, Dai X, Yuan L, Li Y, et al. (2022) Regionclip: Region-based language-image pretraining. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16793–16803
Feng C, Zhong Y, Jie Z, Chu X, Ren H, Wei X, Xie W, Ma L (2022) Promptdet: Towards open-vocabulary detection using uncurated images. In: European conference on computer vision, pp 701–717. Springer
Zang Y, Li W, Zhou K, Huang C, Loy CC (2022) Open-vocabulary detr with conditional matching. In: European conference on computer vision, pp 106–122. Springer
Kim D, Angelova A, Kuo W (2023) Region-aware pretraining for open-vocabulary object detection with vision transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11144–11154
Wu S, Zhang W, Jin S, Liu W, Loy CC (2023) Aligning bag of regions for open-vocabulary object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15254–15264
Wu X, Zhu F, Zhao R, Li H (2023) Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 7031–7040:
Cheng T, Song L, Ge Y, Liu W, Wang X, Shan Y (2024) Yolo-world: Real-time open-vocabulary object detection. ArXiv arxiv:2401.17270
Rosenberg C, Hebert M, Schneiderman H (2005) Semi-supervised self-training of object detection models
Jeong J, Lee S, Kim J, Kwak N (2019) Consistency-based semi-supervised learning for object detection. Adv Neural Inform Process Syst 32
Tang P, Ramaiah C, Wang Y, Xu R, Xiong C (2021) Proposal learning for semi-supervised object detection. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2291–2301
Radosavovic I, Dollár P, Girshick R, Gkioxari G, He K (2018) Data distillation: Towards omni-supervised learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4119–4128
Zoph B, Ghiasi G, Lin T-Y, Cui Y, Liu H, Cubuk ED, Le Q (2020) Rethinking pre-training and self-training. Adv Neural Inf Process Syst 33:3833–3845
Li Y, Huang D, Qin D, Wang L, Gong B (2020) Improving object detection with selective self-supervised self-training. In: European conference on computer vision, pp 589–607. Springer
Sohn K, Zhang Z, Li C-L, Zhang H, Lee C-Y, Pfister T (2020) A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757
Wang K, Yan X, Zhang D, Zhang L, Lin L (2018) Towards human-machine cooperation: Self-supervised sample mining for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1605–1613
Liu Y-C, Ma C-Y, He Z, Kuo C-W, Chen K, Zhang P, Wu B, Kira Z, Vajda P (2021) Unbiased teacher for semi-supervised object detection. arXiv preprint arXiv:2102.09480
Zang Y, Zhou K, Huang C, Loy CC (2023) Semi-supervised and long-tailed object detection with cascadematch. Int J Comput Vis 131(4):987–1001
Liu C, Zhang W, Lin X, Zhang W, Tan X, Han J, Li X, Ding E, Wang J (2023) Ambiguity-resistant semi-supervised learning for dense object detection. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15579–15588
Hou Q, Zhou D, Feng J (2021) Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13713–13722
Liu H, Liu F, Fan X, Huang D (2021) Polarized self-attention: Towards high-quality pixel-wise regression. arXiv preprint arXiv:2107.00782
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104:154–171
Zhang P, Li X, Hu X, Yang J, Zhang L, Wang L, Choi Y, Gao J (2021) Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5579–5588
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inform Process Syst 28
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp 740–755. Springer
Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A et al (2020) The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vis 128(7):1956–1981
Gupta A, Dollar P, Girshick R (2019) Lvis: A dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5356–5364
Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88:303–338
Zheng Y, Huang R, Han C, Huang X, Cui L (2020) Background learnable cascade for zero-shot object detection. In: Proceedings of the Asian conference on computer vision
Xie J, Zheng S (2022) Zero-shot object detection through vision-language embedding alignment. In: 2022 IEEE international conference on data mining workshops (ICDMW), pp 1–15. IEEE
Huang P, Han J, Cheng D, Zhang D (2022) Robust region feature synthesizer for zero-shot object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7622–7631
Rahman S, Khan S, Porikli F (2018) Zero-shot object detection: learning to simultaneously recognize and localize novel concepts. In: Asian conference on computer vision, pp 547–563. Springer
Li Z, Yao L, Zhang X, Wang X, Kanhere S, Zhang H (2019) Zero-shot object detection with textual descriptions. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8690–8697
Demirel B, Cinbis RG, Ikizler-Cinbis N (2018) Zero-shot object detection by hybrid region embedding. arXiv preprint arXiv:1805.06157
Hayat N, Hayat M, Rahman S, Khan S, Zamir SW, Khan FS (2020) Synthesizing the unseen for zero-shot object detection. In: Proceedings of the Asian Conference on computer vision
Xian Y, Sharma S, Schiele B, Akata Z (2019) f-vaegan-d2: A feature generating framework for any-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10275–10284
Jocher G, Chaurasia A, Stoken A, Borovec J, Kwon Y, Michael K, Fang J, Wong C, Yifu Z, Montes D, et al (2022) ultralytics/yolov5: v6. 2-yolov5 classification models, apple m1, reproducibility, clearml and deci
Wang C-Y, Bochkovskiy A, Liao H-YM (2023) Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7464–7475
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Howard A, Sandler M, Chu G, Chen L-C, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V et al.: (2019) Searching for mobilenetv3. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1314–1324
Redmon J, Farhadi A (2018) Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant 62371015, and in part by Beijing Natural Science Foundation under Grant L211017, and in part by the General Program of Beijing Municipal Education Commission under Grant KM202110005027, and in part by National Natural Science Foundation of China under Grant 61971016 and 61701011.
Author information
Authors and Affiliations
Contributions
Methodology, J.L. and L.Z.; Software, S.S.,K.Z.; Validation, Z.K. ,S.S. and J.L.; Writing—original draft, J.L. K.Z; Writing—review & editing, J.L., S.S.,K.Z. and J.Z. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, J., Sun, S., Zhang, K. et al. Single-stage zero-shot object detection network based on CLIP and pseudo-labeling. Int. J. Mach. Learn. & Cyber. (2024). https://doi.org/10.1007/s13042-024-02321-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13042-024-02321-1