Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3581783.3612523acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Zero-Shot Object Detection by Semantics-Aware DETR with Adaptive Contrastive Loss

Published: 27 October 2023 Publication History

Abstract

Zero-shot object detection (ZSD) aims to localize and recognize unseen objects in unconstrained images by leveraging semantic descriptions. Existing ZSD methods typically suffer from two drawbacks: 1) Due to the lack of data on unseen categories during the training phase, the model inevitably has a bias towards the seen categories, i.e., it prefers to subsume objects of unseen categories to seen categories; 2) It is usually very tricky for the feature extractor trained on data of seen categories to learn discriminative features that are good enough to help the model transfer the knowledge learned from data of seen categories to unseen categories. To tackle these problems, this paper proposes a novel zero-shot detection method based on a semantics-aware DETR and a class-wise adaptive contrastive loss. Concretely, to address the first problem, we develop a novel semantics-aware attention mechanism to mitigate the bias towards seen categories and integrate it into DETR, which results in a new end-to-end zero-shot object detection approach. Furthermore, to handle the second problem, a novel class-wise adaptive contrastive loss is proposed, which considers the relevance between each pair of categories according to their semantic description in order to learn separable features for better visual-semantic alignment. Extensive experiments and ablation studies on benchmark datasets demonstrate the effectiveness and superiority of the proposed method.

References

[1]
Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. 2018. Zero-Shot Object Detection. European Conference on Computer Vision (ECCV) (2018).
[2]
Abhijit Bendale and Terrance E. Boult. 2016. Towards Open Set Deep Networks. In Computer Vision and Pattern Recognition (CVPR). IEEE, 1563--1572.
[3]
Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade R-CNN: Delving Into High Quality Object Detection. In Computer Vision and Pattern Recognition (CVPR). 6154--6162.
[4]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. In European Conference on Computer Vision (ECCV). 213--229.
[5]
Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. 2016. Synthesized Classifiers for Zero-Shot Learning. In Computer Vision and Pattern Recognition (CVPR). 5327--5336.
[6]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In International Conference on Machine Learning (ICML). 1597--1607.
[7]
Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016. R-FCN: Object Detection via Region-based Fully Convolutional Networks. In Conference on Neural Information Processing Systems (NeurIPS). 379--387.
[8]
Berkan Demirel, Ramazan Gokberk Cinbis, and Nazli Ikizler-Cinbis. 2018. Zero-Shot Object Detection by Hybrid Region Embedding. In British Machine Vision Conference (BMVC). 56.
[9]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR).
[10]
Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2009. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, Vol. 88, 2 (2009), 303--338.
[11]
Ali Farhadi, Ian Endres, Derek Hoiem, and David A. Forsyth. 2009. Describing objects by their attributes. In Computer Vision and Pattern Recognition (CVPR). 1778--1785.
[12]
Yanwei Fu, Timothy M. Hospedales, Tao Xiang, and Shaogang Gong. 2015. Transductive Multi-View Zero-Shot Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 37, 11 (2015), 2332--2345.
[13]
Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Journal of Machine Learning Research (JMLR). 297--304.
[14]
Nasir Hayat, Munawar Hayat, Shafin Rahman, Salman H. Khan, Syed Waqas Zamir, and Fahad Shahbaz Khan. 2020. Synthesizing the Unseen for Zero-Shot Object Detection. In Asian Conference on Computer Vision (ACCV). 155--170.
[15]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9726--9735.
[16]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV). 2980--2988.
[17]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 770--778.
[18]
Peiliang Huang, Junwei Han, De Cheng, and Dingwen Zhang. 2022. Robust Region Feature Synthesizer for Zero-Shot Object Detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7622--7631.
[19]
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised Contrastive Learning. In Conference on Neural Information Processing Systems (NeurIPS).
[20]
Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In International Conference on Learning Representations (ICLR).
[21]
Elyor Kodirov, Tao Xiang, and Shaogang Gong. 2017. Semantic Autoencoder for Zero-Shot Learning. In Computer Vision and Pattern Recognition (CVPR). 4447--4456.
[22]
Harold W. Kuhn. 2010. The Hungarian Method for the Assignment Problem. Springer Berlin Heidelberg. 29--47 pages.
[23]
Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 951--958.
[24]
Yanan Li, Donghui Wang, Huanhang Hu, Yuetan Lin, and Yueting Zhuang. 2017. Zero-Shot Recognition Using Dual Visual-Semantic Mapping Paths. In Computer Vision and Pattern Recognition (CVPR). IEEE, 5207--5215.
[25]
Zhihui Li, Lina Yao, Xiaoqin Zhang, Xianzhi Wang, Salil Kanhere, and Huaxiang Zhang. 2019. Zero-Shot Object Detection with Textual Descriptions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. Association for the Advancement of Artificial Intelligence (AAAI), 8690--8697.
[26]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2020. Focal Loss for Dense Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 42, 2 (2020), 318--327.
[27]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision (ECCV). 740--755.
[28]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector. 21--37 pages.
[29]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In IEEE International Conference on Computer Vision (ICCV). 9992--10002.
[30]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR).
[31]
Laurens Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).
[32]
Tomas Mikolov, Kai Chen, G. Corrado, and J. Dean. 2013. Efficient Estimation of Word Representations in Vector Space. ICLR (2013).
[33]
Ashish Mishra, M. Shiva Krishna Reddy, Anurag Mittal, and Hema A. Murthy. 2018. A Generative Model for Zero Shot Learning Using Conditional Variational Autoencoders. In Computer Vision and Pattern Recognition (CVPR) Workshop. 2188--2196.
[34]
Hui Nie, Ruiping Wang, and Xilin Chen. 2022. From Node to Graph: Joint Reasoning on Visual-Semantic Relational Graph for Zero-Shot Detection. In IEEE Winter Conference on Applications of Computer Vision (WACV). 1648--1657.
[35]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv (2018).
[36]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532--1543.
[37]
Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning (ICML). 8748--8763.
[38]
Milos Radovanovic, Alexandros Nanopoulos, and Mirjana Ivanovic. 2010. Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data. Journal of Machine Learning Research (JMLR), Vol. 11 (2010), 2487--2531.
[39]
Shafin Rahman, Salman Khan, and Fatih Porikli. 2018. A Unified Approach for Conventional Zero-Shot, Generalized Zero-Shot, and Few-Shot Learning. IEEE Transactions on Image Processing, Vol. 27, 11 (2018), 5652--5667.
[40]
Shafin Rahman, Salman Khan, and Fatih Porikli. 2019. Zero-Shot Object Detection: Learning to Simultaneously Recognize and Localize Novel Concepts. Springer International Publishing. 547--563 pages.
[41]
Shafin Rahman, Salman H. Khan, and Nick Barnes. 2020. Improved Visual-Semantic Alignment for Zero-Shot Object Detection. In AAAI Conference on Artificial Intelligence (AAAI). 11932--11939.
[42]
Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. In Computer Vision and Pattern Recognition (CVPR). 779--788.
[43]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 39, 6 (2017), 1137--1149.
[44]
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian D. Reid, and Silvio Savarese. 2019. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Computer Vision and Pattern Recognition (CVPR). 658--666.
[45]
Bernardino Romera-Paredes and Philip H. S. Torr. 2015. An embarrassingly simple approach to zero-shot learning. In International Conference on Machine Learning (ICML). 2152--2161.
[46]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. 234--241 pages.
[47]
Sandipan Sarma, Sushil Kumar, and Arijit Sur. 2022. Resolving Semantic Confusions for Improved Zero-Shot Detection. In British Machine Vision Conference (BMVC). 347.
[48]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Conference on Neural Information Processing Systems (NeurIPS). 5998--6008.
[49]
Yongqin Xian, Bernt Schiele, and Zeynep Akata. 2017. Zero-Shot Learning - The Good, the Bad and the Ugly. In Computer Vision and Pattern Recognition (CVPR). 3077--3086.
[50]
Yongqin Xian, Saurabh Sharma, Bernt Schiele, and Zeynep Akata. 2019. F-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning. In Computer Vision and Pattern Recognition (CVPR). 10275--10284.
[51]
Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and Qinghua Zheng. 2021. Semantics-Guided Contrastive Network for Zero-Shot Object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2021).
[52]
Eloi Zablocki, Patrick Bordes, Laure Soulier, Benjamin Piwowarski, and Patrick Gallinari. 2019. Context-Aware Zero-Shot Learning for Object Recognition. In International Conference on Machine Learning (ICML). 7292--7303.
[53]
Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel Ni, and Harry Shum. 2023. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. ICLR 2023 poster, Vol. abs/2203.03605 (2023).
[54]
Ziming Zhang and Venkatesh Saligrama. 2015. Zero-Shot Learning via Semantic Similarity Embedding. In IEEE International Conference on Computer Vision (ICCV). 4166--4174.
[55]
Ziming Zhang and Venkatesh Saligrama. 2016. Zero-Shot Learning via Joint Latent Similarity Embedding. In Computer Vision and Pattern Recognition (CVPR). 6034--6042.
[56]
Shizhen Zhao, Changxin Gao, Yuanjie Shao, Lerenhan Li, Changqian Yu, Zhong Ji, and Nong Sang. 2020. GTNet: Generative Transfer Network for Zero-Shot Object Detection. In AAAI Conference on Artificial Intelligence (AAAI). 12967--12974.
[57]
Ye Zheng, Ruoran Huang, Chuanqi Han, Xi Huang, and Li Cui. 2021b. Background Learnable Cascade for Zero-Shot Object Detection. Springer International Publishing. 107--123 pages.
[58]
Ye Zheng, Xi Huang, and Li Cui. 2021a. Visual Language Based Succinct Zero-Shot Object Detection. In Proceedings of the 29th ACM International Conference on Multimedia. ACM.
[59]
Pengkai Zhu, Hanxiao Wang, and Venkatesh Saligrama. 2020. Don't Even Look Once: Synthesizing Features for Zero-Shot Detection. In Computer Vision and Pattern Recognition (CVPR).

Cited By

View all
  • (2024)Fractional Correspondence Framework in Detection TransformerProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681613(5498-5506)Online publication date: 28-Oct-2024
  • (2024)Uni-YOLO: Vision-Language Model-Guided YOLO for Robust and Fast Universal Detection in the Open WorldProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681212(1991-2000)Online publication date: 28-Oct-2024
  • (2024)Accelerated Data Engine: A faster dataset construction workflow for computer vision applications in commercial livestock farmsComputers and Electronics in Agriculture10.1016/j.compag.2024.109452226(109452)Online publication date: Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. contrastive learning
  2. detr
  3. object detection
  4. zero-shot learning

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foudnation of China (NSFC)

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)268
  • Downloads (Last 6 weeks)31
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Fractional Correspondence Framework in Detection TransformerProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681613(5498-5506)Online publication date: 28-Oct-2024
  • (2024)Uni-YOLO: Vision-Language Model-Guided YOLO for Robust and Fast Universal Detection in the Open WorldProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681212(1991-2000)Online publication date: 28-Oct-2024
  • (2024)Accelerated Data Engine: A faster dataset construction workflow for computer vision applications in commercial livestock farmsComputers and Electronics in Agriculture10.1016/j.compag.2024.109452226(109452)Online publication date: Nov-2024
  • (2024)Single-stage zero-shot object detection network based on CLIP and pseudo-labelingInternational Journal of Machine Learning and Cybernetics10.1007/s13042-024-02321-1Online publication date: 20-Aug-2024
  • (2024)M-RRFS: A Memory-Based Robust Region Feature Synthesizer for Zero-Shot Object DetectionInternational Journal of Computer Vision10.1007/s11263-024-02112-9132:10(4651-4672)Online publication date: 22-May-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media