research-article

Zero-Shot Object Detection by Semantics-Aware DETR with Adaptive Contrastive Loss

Authors:

Shuigeng ZhouAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 4421 - 4430

https://doi.org/10.1145/3581783.3612523

Published: 27 October 2023 Publication History

Abstract

Zero-shot object detection (ZSD) aims to localize and recognize unseen objects in unconstrained images by leveraging semantic descriptions. Existing ZSD methods typically suffer from two drawbacks: 1) Due to the lack of data on unseen categories during the training phase, the model inevitably has a bias towards the seen categories, i.e., it prefers to subsume objects of unseen categories to seen categories; 2) It is usually very tricky for the feature extractor trained on data of seen categories to learn discriminative features that are good enough to help the model transfer the knowledge learned from data of seen categories to unseen categories. To tackle these problems, this paper proposes a novel zero-shot detection method based on a semantics-aware DETR and a class-wise adaptive contrastive loss. Concretely, to address the first problem, we develop a novel semantics-aware attention mechanism to mitigate the bias towards seen categories and integrate it into DETR, which results in a new end-to-end zero-shot object detection approach. Furthermore, to handle the second problem, a novel class-wise adaptive contrastive loss is proposed, which considers the relevance between each pair of categories according to their semantic description in order to learn separable features for better visual-semantic alignment. Extensive experiments and ablation studies on benchmark datasets demonstrate the effectiveness and superiority of the proposed method.

References

[1]

Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. 2018. Zero-Shot Object Detection. European Conference on Computer Vision (ECCV) (2018).

[2]

Abhijit Bendale and Terrance E. Boult. 2016. Towards Open Set Deep Networks. In Computer Vision and Pattern Recognition (CVPR). IEEE, 1563--1572.

[3]

Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade R-CNN: Delving Into High Quality Object Detection. In Computer Vision and Pattern Recognition (CVPR). 6154--6162.

[4]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. In European Conference on Computer Vision (ECCV). 213--229.

Digital Library

[5]

Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. 2016. Synthesized Classifiers for Zero-Shot Learning. In Computer Vision and Pattern Recognition (CVPR). 5327--5336.

[6]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In International Conference on Machine Learning (ICML). 1597--1607.

[7]

Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016. R-FCN: Object Detection via Region-based Fully Convolutional Networks. In Conference on Neural Information Processing Systems (NeurIPS). 379--387.

[8]

Berkan Demirel, Ramazan Gokberk Cinbis, and Nazli Ikizler-Cinbis. 2018. Zero-Shot Object Detection by Hybrid Region Embedding. In British Machine Vision Conference (BMVC). 56.

[9]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR).

[10]

Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2009. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, Vol. 88, 2 (2009), 303--338.

Digital Library

[11]

Ali Farhadi, Ian Endres, Derek Hoiem, and David A. Forsyth. 2009. Describing objects by their attributes. In Computer Vision and Pattern Recognition (CVPR). 1778--1785.

[12]

Yanwei Fu, Timothy M. Hospedales, Tao Xiang, and Shaogang Gong. 2015. Transductive Multi-View Zero-Shot Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 37, 11 (2015), 2332--2345.

Digital Library

[13]

Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Journal of Machine Learning Research (JMLR). 297--304.

[14]

Nasir Hayat, Munawar Hayat, Shafin Rahman, Salman H. Khan, Syed Waqas Zamir, and Fahad Shahbaz Khan. 2020. Synthesizing the Unseen for Zero-Shot Object Detection. In Asian Conference on Computer Vision (ACCV). 155--170.

[15]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9726--9735.

[16]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV). 2980--2988.

[17]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 770--778.

[18]

Peiliang Huang, Junwei Han, De Cheng, and Dingwen Zhang. 2022. Robust Region Feature Synthesizer for Zero-Shot Object Detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7622--7631.

[19]

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised Contrastive Learning. In Conference on Neural Information Processing Systems (NeurIPS).

[20]

Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In International Conference on Learning Representations (ICLR).

[21]

Elyor Kodirov, Tao Xiang, and Shaogang Gong. 2017. Semantic Autoencoder for Zero-Shot Learning. In Computer Vision and Pattern Recognition (CVPR). 4447--4456.

[22]

Harold W. Kuhn. 2010. The Hungarian Method for the Assignment Problem. Springer Berlin Heidelberg. 29--47 pages.

[23]

Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 951--958.

[24]

Yanan Li, Donghui Wang, Huanhang Hu, Yuetan Lin, and Yueting Zhuang. 2017. Zero-Shot Recognition Using Dual Visual-Semantic Mapping Paths. In Computer Vision and Pattern Recognition (CVPR). IEEE, 5207--5215.

[25]

Zhihui Li, Lina Yao, Xiaoqin Zhang, Xianzhi Wang, Salil Kanhere, and Huaxiang Zhang. 2019. Zero-Shot Object Detection with Textual Descriptions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. Association for the Advancement of Artificial Intelligence (AAAI), 8690--8697.

Digital Library

[26]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2020. Focal Loss for Dense Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 42, 2 (2020), 318--327.

[27]

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision (ECCV). 740--755.

[28]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector. 21--37 pages.

[29]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In IEEE International Conference on Computer Vision (ICCV). 9992--10002.

[30]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR).

[31]

Laurens Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).

[32]

Tomas Mikolov, Kai Chen, G. Corrado, and J. Dean. 2013. Efficient Estimation of Word Representations in Vector Space. ICLR (2013).

[33]

Ashish Mishra, M. Shiva Krishna Reddy, Anurag Mittal, and Hema A. Murthy. 2018. A Generative Model for Zero Shot Learning Using Conditional Variational Autoencoders. In Computer Vision and Pattern Recognition (CVPR) Workshop. 2188--2196.

[34]

Hui Nie, Ruiping Wang, and Xilin Chen. 2022. From Node to Graph: Joint Reasoning on Visual-Semantic Relational Graph for Zero-Shot Detection. In IEEE Winter Conference on Applications of Computer Vision (WACV). 1648--1657.

[35]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv (2018).

[36]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532--1543.

[37]

Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning (ICML). 8748--8763.

[38]

Milos Radovanovic, Alexandros Nanopoulos, and Mirjana Ivanovic. 2010. Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data. Journal of Machine Learning Research (JMLR), Vol. 11 (2010), 2487--2531.

Digital Library

[39]

Shafin Rahman, Salman Khan, and Fatih Porikli. 2018. A Unified Approach for Conventional Zero-Shot, Generalized Zero-Shot, and Few-Shot Learning. IEEE Transactions on Image Processing, Vol. 27, 11 (2018), 5652--5667.

[40]

Shafin Rahman, Salman Khan, and Fatih Porikli. 2019. Zero-Shot Object Detection: Learning to Simultaneously Recognize and Localize Novel Concepts. Springer International Publishing. 547--563 pages.

[41]

Shafin Rahman, Salman H. Khan, and Nick Barnes. 2020. Improved Visual-Semantic Alignment for Zero-Shot Object Detection. In AAAI Conference on Artificial Intelligence (AAAI). 11932--11939.

[42]

Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. In Computer Vision and Pattern Recognition (CVPR). 779--788.

[43]

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 39, 6 (2017), 1137--1149.

Digital Library

[44]

Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian D. Reid, and Silvio Savarese. 2019. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Computer Vision and Pattern Recognition (CVPR). 658--666.

[45]

Bernardino Romera-Paredes and Philip H. S. Torr. 2015. An embarrassingly simple approach to zero-shot learning. In International Conference on Machine Learning (ICML). 2152--2161.

[46]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. 234--241 pages.

[47]

Sandipan Sarma, Sushil Kumar, and Arijit Sur. 2022. Resolving Semantic Confusions for Improved Zero-Shot Detection. In British Machine Vision Conference (BMVC). 347.

[48]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Conference on Neural Information Processing Systems (NeurIPS). 5998--6008.

[49]

Yongqin Xian, Bernt Schiele, and Zeynep Akata. 2017. Zero-Shot Learning - The Good, the Bad and the Ugly. In Computer Vision and Pattern Recognition (CVPR). 3077--3086.

[50]

Yongqin Xian, Saurabh Sharma, Bernt Schiele, and Zeynep Akata. 2019. F-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning. In Computer Vision and Pattern Recognition (CVPR). 10275--10284.

[51]

Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and Qinghua Zheng. 2021. Semantics-Guided Contrastive Network for Zero-Shot Object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2021).

[52]

Eloi Zablocki, Patrick Bordes, Laure Soulier, Benjamin Piwowarski, and Patrick Gallinari. 2019. Context-Aware Zero-Shot Learning for Object Recognition. In International Conference on Machine Learning (ICML). 7292--7303.

[53]

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel Ni, and Harry Shum. 2023. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. ICLR 2023 poster, Vol. abs/2203.03605 (2023).

[54]

Ziming Zhang and Venkatesh Saligrama. 2015. Zero-Shot Learning via Semantic Similarity Embedding. In IEEE International Conference on Computer Vision (ICCV). 4166--4174.

Digital Library

[55]

Ziming Zhang and Venkatesh Saligrama. 2016. Zero-Shot Learning via Joint Latent Similarity Embedding. In Computer Vision and Pattern Recognition (CVPR). 6034--6042.

[56]

Shizhen Zhao, Changxin Gao, Yuanjie Shao, Lerenhan Li, Changqian Yu, Zhong Ji, and Nong Sang. 2020. GTNet: Generative Transfer Network for Zero-Shot Object Detection. In AAAI Conference on Artificial Intelligence (AAAI). 12967--12974.

[57]

Ye Zheng, Ruoran Huang, Chuanqi Han, Xi Huang, and Li Cui. 2021b. Background Learnable Cascade for Zero-Shot Object Detection. Springer International Publishing. 107--123 pages.

[58]

Ye Zheng, Xi Huang, and Li Cui. 2021a. Visual Language Based Succinct Zero-Shot Object Detection. In Proceedings of the 29th ACM International Conference on Multimedia. ACM.

Digital Library

[59]

Pengkai Zhu, Hanxiao Wang, and Venkatesh Saligrama. 2020. Don't Even Look Once: Synthesizing Features for Zero-Shot Detection. In Computer Vision and Pattern Recognition (CVPR).

Cited By

Zareapoor MShamsolmoali PZhou HLu YGarcía SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Fractional Correspondence Framework in Detection TransformerProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681613(5498-5506)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681613
Wang XRen WChen XFan HTang YHan ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Uni-YOLO: Vision-Language Model-Guided YOLO for Robust and Fast Universal Detection in the Open WorldProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681212(1991-2000)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681212
Wu YZhou SWu ZChen ZHu XLi J(2024)Accelerated Data Engine: A faster dataset construction workflow for computer vision applications in commercial livestock farmsComputers and Electronics in Agriculture10.1016/j.compag.2024.109452226(109452)Online publication date: Nov-2024
https://doi.org/10.1016/j.compag.2024.109452
Show More Cited By

Index Terms

Zero-Shot Object Detection by Semantics-Aware DETR with Adaptive Contrastive Loss
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object detection
  2. Machine learning
    1. Learning paradigms
      1. Multi-task learning
        Transfer learning

Recommendations

Metadata-Induced Contrastive Learning for Zero-Shot Multi-Label Text Classification
WWW '22: Proceedings of the ACM Web Conference 2022

Large-scale multi-label text classification (LMTC) aims to associate a document with its relevant labels from a large candidate set. Most existing LMTC approaches rely on massive human-annotated training data, which are often costly to obtain and suffer ...
Transductive Visual-Semantic Embedding for Zero-shot Learning
ICMR '17: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval

Zero-shot learning (ZSL) aims to bridge the knowledge transfer via available semantic representations (e.g., attributes) between labeled source instances of seen classes and unlabelled target instances of unseen classes. Most existing ZSL approaches ...
Zero-Shot Object Detection: Joint Recognition and Localization of Novel Concepts
Abstract
Zero shot learning (ZSL) identifies unseen objects for which no training images are available. Conventional ZSL approaches are restricted to a recognition setting where each test image is categorized into one of several unseen object classes. We ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foudnation of China (NSFC)

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
310
Total Downloads

Downloads (Last 12 months)161
Downloads (Last 6 weeks)13

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zareapoor MShamsolmoali PZhou HLu YGarcía SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Fractional Correspondence Framework in Detection TransformerProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681613(5498-5506)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681613
Wang XRen WChen XFan HTang YHan ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Uni-YOLO: Vision-Language Model-Guided YOLO for Robust and Fast Universal Detection in the Open WorldProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681212(1991-2000)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681212
Wu YZhou SWu ZChen ZHu XLi J(2024)Accelerated Data Engine: A faster dataset construction workflow for computer vision applications in commercial livestock farmsComputers and Electronics in Agriculture10.1016/j.compag.2024.109452226(109452)Online publication date: Nov-2024
https://doi.org/10.1016/j.compag.2024.109452
Li JSun SZhang KZhang JZhuo L(2024)Single-stage zero-shot object detection network based on CLIP and pseudo-labelingInternational Journal of Machine Learning and Cybernetics10.1007/s13042-024-02321-116:2(1055-1070)Online publication date: 20-Aug-2024
https://doi.org/10.1007/s13042-024-02321-1
Huang PZhang DCheng DHan LZhu PHan J(2024)M-RRFS: A Memory-Based Robust Region Feature Synthesizer for Zero-Shot Object DetectionInternational Journal of Computer Vision10.1007/s11263-024-02112-9132:10(4651-4672)Online publication date: 22-May-2024
https://doi.org/10.1007/s11263-024-02112-9

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten