Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Single-stage zero-shot object detection network based on CLIP and pseudo-labeling

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

The detection of unknown objects is a challenging task in computer vision because, although there are diverse real-world detection object categories, existing object-detection training sets cover a limited number of object categories . Most existing approaches use two-stage networks to improve a model’s ability to characterize objects of unknown classes, which leads to slow inference. To address this issue, we proposed a single-stage unknown object detection method based on the contrastive language-image pre-training (CLIP) model and pseudo-labelling, called CLIP-YOLO. First, a visual language embedding alignment method is introduced and a channel-grouped enhanced coordinate attention module is embedded into a YOLO-series detection head and feature-enhancing component, to improve the model’s ability to characterize and detect unknown category objects. Second, the pseudo-labelling generation is optimized based on the CLIP model to expand the diversity of the training set and enhance the ability to cover unknown object categories. We validated this method on four challenging datasets: MSCOCO, ILSVRC, Visual Genome, and PASCAL VOC. The results show that our method can achieve higher accuracy and faster speed, so as to obtain better performance of unknown object detection. The source code is available at https://github.com/BJUTsipl/CLIP-YOLO.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. Zou Z, Chen K, Shi Z, Guo Y, Ye J (2023) Object detection in 20 years: a survey. Proceedings of the IEEE

  2. Liang W, Xue F, Liu Y, Zhong G, Ming A (2023) Unknown sniffer for object detection: Don’t turn a blind eye to unknown objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3230–3239

  3. Bansal A, Sikka K, Sharma G, Chellappa R, Divakaran A (2018) Zero-shot object detection. In: Proceedings of the European conference on computer vision (ECCV), pp 384–400

  4. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763

  5. Zareian A, Rosa KD, Hu DH, Chang S-F (2021) Open-vocabulary object detection using captions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14393–14402

  6. Zhao S, Zhang Z, Schulter S, Zhao L, Vijay Kumar B, Stathopoulos A, Chandraker M, Metaxas DN (2022) Exploiting unlabeled data with vision and language models for object detection. In: European conference on computer vision, pp 159–175. Springer

  7. Xie Q, Luong M-T, Hovy E, Le QV (2020) Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10687–10698

  8. Tang Y, Chen W, Luo Y, Zhang Y (2021) Humble teachers teach better students for semi-supervised object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3132–3141

  9. Rahman S, Khan S, Barnes N (2020) Improved visual-semantic alignment for zero-shot object detection. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 11932–11939

  10. Zhao S, Gao C, Shao Y, Li L, Yu C, Ji Z, Sang N (2020) Gtnet: Generative transfer network for zero-shot object detection. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 12967–12974

  11. Zheng Y, Wu J, Qin Y, Zhang F, Cui L (2021) Zero-shot instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2593–2602

  12. Zhang L, Zhang C, Zhao J, Guan J, Zhou S (2023) Meta-zsdetr: Zero-shot detr with meta-learning. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 6845–6854

  13. Liu H, Zhang L, Guan J, Zhou S (2023) Zero-shot object detection by semantics-aware detr with adaptive contrastive loss. In: Proceedings of the 31st ACM international conference on multimedia, pp 4421–4430

  14. He S, Ding H, Jiang W (2023) Semantic-promoted debiasing and background disambiguation for zero-shot instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 19498–19507

  15. Khandelwal S, Nambirajan A, Siddiquie B, Eledath J, Sigal L (2023) Frustratingly simple but effective zero-shot detection and segmentation: analysis and a strong baseline. ArXiv arxiv:2302.07319

  16. He S, Ding H, Jiang W (2023) Primitive generation and semantic-related alignment for universal zero-shot segmentation. In 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 11238–11247

  17. Huang P, Zhang D, Cheng D, Han L, Zhu P, Han J (2024) M-RRFs: A memory-based robust region feature synthesizer for zero-shot object detection. Int J Comput Vis. https://doi.org/10.1007/s11263-024-02112-9

  18. Gu X, Lin T-Y, Kuo W, Cui Y (2021) Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921

  19. Zhong Y, Yang J, Zhang P, Li C, Codella N, Li LH, Zhou L, Dai X, Yuan L, Li Y, et al. (2022) Regionclip: Region-based language-image pretraining. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16793–16803

  20. Feng C, Zhong Y, Jie Z, Chu X, Ren H, Wei X, Xie W, Ma L (2022) Promptdet: Towards open-vocabulary detection using uncurated images. In: European conference on computer vision, pp 701–717. Springer

  21. Zang Y, Li W, Zhou K, Huang C, Loy CC (2022) Open-vocabulary detr with conditional matching. In: European conference on computer vision, pp 106–122. Springer

  22. Kim D, Angelova A, Kuo W (2023) Region-aware pretraining for open-vocabulary object detection with vision transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11144–11154

  23. Wu S, Zhang W, Jin S, Liu W, Loy CC (2023) Aligning bag of regions for open-vocabulary object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15254–15264

  24. Wu X, Zhu F, Zhao R, Li H (2023) Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 7031–7040:

  25. Cheng T, Song L, Ge Y, Liu W, Wang X, Shan Y (2024) Yolo-world: Real-time open-vocabulary object detection. ArXiv arxiv:2401.17270

  26. Rosenberg C, Hebert M, Schneiderman H (2005) Semi-supervised self-training of object detection models

  27. Jeong J, Lee S, Kim J, Kwak N (2019) Consistency-based semi-supervised learning for object detection. Adv Neural Inform Process Syst 32

  28. Tang P, Ramaiah C, Wang Y, Xu R, Xiong C (2021) Proposal learning for semi-supervised object detection. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2291–2301

  29. Radosavovic I, Dollár P, Girshick R, Gkioxari G, He K (2018) Data distillation: Towards omni-supervised learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4119–4128

  30. Zoph B, Ghiasi G, Lin T-Y, Cui Y, Liu H, Cubuk ED, Le Q (2020) Rethinking pre-training and self-training. Adv Neural Inf Process Syst 33:3833–3845

    Google Scholar 

  31. Li Y, Huang D, Qin D, Wang L, Gong B (2020) Improving object detection with selective self-supervised self-training. In: European conference on computer vision, pp 589–607. Springer

  32. Sohn K, Zhang Z, Li C-L, Zhang H, Lee C-Y, Pfister T (2020) A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757

  33. Wang K, Yan X, Zhang D, Zhang L, Lin L (2018) Towards human-machine cooperation: Self-supervised sample mining for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1605–1613

  34. Liu Y-C, Ma C-Y, He Z, Kuo C-W, Chen K, Zhang P, Wu B, Kira Z, Vajda P (2021) Unbiased teacher for semi-supervised object detection. arXiv preprint arXiv:2102.09480

  35. Zang Y, Zhou K, Huang C, Loy CC (2023) Semi-supervised and long-tailed object detection with cascadematch. Int J Comput Vis 131(4):987–1001

    Article  Google Scholar 

  36. Liu C, Zhang W, Lin X, Zhang W, Tan X, Han J, Li X, Ding E, Wang J (2023) Ambiguity-resistant semi-supervised learning for dense object detection. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15579–15588

  37. Hou Q, Zhou D, Feng J (2021) Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13713–13722

  38. Liu H, Liu F, Fan X, Huang D (2021) Polarized self-attention: Towards high-quality pixel-wise regression. arXiv preprint arXiv:2107.00782

  39. Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104:154–171

    Article  Google Scholar 

  40. Zhang P, Li X, Hu X, Yang J, Zhang L, Wang L, Choi Y, Gao J (2021) Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5579–5588

  41. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inform Process Syst 28

  42. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp 740–755. Springer

  43. Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A et al (2020) The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vis 128(7):1956–1981

    Article  Google Scholar 

  44. Gupta A, Dollar P, Girshick R (2019) Lvis: A dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5356–5364

  45. Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88:303–338

    Article  Google Scholar 

  46. Zheng Y, Huang R, Han C, Huang X, Cui L (2020) Background learnable cascade for zero-shot object detection. In: Proceedings of the Asian conference on computer vision

  47. Xie J, Zheng S (2022) Zero-shot object detection through vision-language embedding alignment. In: 2022 IEEE international conference on data mining workshops (ICDMW), pp 1–15. IEEE

  48. Huang P, Han J, Cheng D, Zhang D (2022) Robust region feature synthesizer for zero-shot object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7622–7631

  49. Rahman S, Khan S, Porikli F (2018) Zero-shot object detection: learning to simultaneously recognize and localize novel concepts. In: Asian conference on computer vision, pp 547–563. Springer

  50. Li Z, Yao L, Zhang X, Wang X, Kanhere S, Zhang H (2019) Zero-shot object detection with textual descriptions. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8690–8697

  51. Demirel B, Cinbis RG, Ikizler-Cinbis N (2018) Zero-shot object detection by hybrid region embedding. arXiv preprint arXiv:1805.06157

  52. Hayat N, Hayat M, Rahman S, Khan S, Zamir SW, Khan FS (2020) Synthesizing the unseen for zero-shot object detection. In: Proceedings of the Asian Conference on computer vision

  53. Xian Y, Sharma S, Schiele B, Akata Z (2019) f-vaegan-d2: A feature generating framework for any-shot learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10275–10284

  54. Jocher G, Chaurasia A, Stoken A, Borovec J, Kwon Y, Michael K, Fang J, Wong C, Yifu Z, Montes D, et al (2022) ultralytics/yolov5: v6. 2-yolov5 classification models, apple m1, reproducibility, clearml and deci

  55. Wang C-Y, Bochkovskiy A, Liao H-YM (2023) Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7464–7475

  56. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  57. Howard A, Sandler M, Chu G, Chen L-C, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V et al.: (2019) Searching for mobilenetv3. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1314–1324

  58. Redmon J, Farhadi A (2018) Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 62371015, and in part by Beijing Natural Science Foundation under Grant L211017, and in part by the General Program of Beijing Municipal Education Commission under Grant KM202110005027, and in part by National Natural Science Foundation of China under Grant 61971016 and 61701011.

Author information

Authors and Affiliations

Authors

Contributions

Methodology, J.L. and L.Z.; Software, S.S.,K.Z.; Validation, Z.K. ,S.S. and J.L.; Writing—original draft, J.L. K.Z; Writing—review & editing, J.L., S.S.,K.Z. and J.Z. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Jiafeng Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J., Sun, S., Zhang, K. et al. Single-stage zero-shot object detection network based on CLIP and pseudo-labeling. Int. J. Mach. Learn. & Cyber. (2024). https://doi.org/10.1007/s13042-024-02321-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13042-024-02321-1

Keywords

Navigation