research-article

Two-Stream Transformer for Multi-Label Image Classification

Authors:

Bo LiuAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 3598 - 3607

https://doi.org/10.1145/3503161.3548343

Published: 10 October 2022 Publication History

Abstract

Multi-label image classification is a fundamental yet challenging task in computer vision that aims to identify multiple objects from a given image. Recent studies on this task mainly focus on learning cross-modal interactions between label semantics and high-level visual representations via an attention operation. However, these one-shot attention based approaches generally perform poorly in establishing accurate and robust alignments between vision and text due to the acknowledged semantic gap. In this paper, we propose a two-stream transformer (TSFormer) learning framework, in which the spatial stream focuses on extracting patch features with a global perception, while the semantic stream aims to learn vision-aware label semantics as well as their correlations via a multi-shot attention mechanism. Specifically, in each layer of TSFormer, a cross-modal attention module is developed to aggregate visual features from spatial stream into semantic stream and update label semantics via a residual connection. In this way, the semantic gap between two streams gradually narrows as the procedure progresses layer by layer, allowing the semantic stream to produce sophisticated visual representations for each label towards accurate label recognition. Extensive experiments on three visual benchmarks, including Pascal VOC 2007, Microsoft COCO and NUS-WIDE, consistently demonstrate that our proposed TSFormer achieves state-of-the-art performance on the multi-label image classification task.

Supplementary Material

MP4 File (MM22-fp2671.mp4)

Presentation video - short version

Download
13.91 MB

References

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).

[2]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213--229.

Digital Library

[3]

Shang-Fu Chen, Yi-Chen Chen, Chih-Kuan Yeh, and Yu-Chiang Frank Wang. 2018. Order-free rnn with visual attention for multi-label classification. In Thirty-Second AAAI Conference on Artificial Intelligence.

[4]

Tianshui Chen, ZhouxiaWang, Guanbin Li, and Liang Lin. 2018. Recurrent attentional reinforcement learning for multi-label image recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.

[5]

Tianshui Chen, Muxin Xu, Xiaolu Hui, HefengWu, and Liang Lin. 2019. Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 522--531.

[6]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In European conference on computer vision. Springer, 104--120.

Digital Library

[7]

Zhao-Min Chen, Xiu-Shen Wei, Xin Jin, and Yanwen Guo. 2019. Multi-label image recognition with joint class-aware map disentangling and label correlation embedding. In 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 622--627.

[8]

Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. 2019. Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5177--5186.

[9]

Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, Dong Shen, Zhongyuan Wang, Nian Shi, and Honglin Liu. 2021. MlTr: Multi-label Classification with Transformer. arXiv preprint arXiv:2106.06195 (2021).

[10]

Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. Nus-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM international conference on image and video retrieval. 1--9.

Digital Library

[11]

Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 702--703.

[12]

Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.

[13]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[14]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[15]

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision 88, 2 (2010), 303--338.

Digital Library

[16]

Bin-Bin Gao and Hong-Yu Zhou. 2020. Multi-label image recognition with multiclass attentional regions. arXiv e-prints (2020), arXiv--2007.

[17]

Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander Toshev, and Sergey Ioffe. 2014. Deep Convolutional Ranking for Multilabel Image Annotation. arXiv:1312.4894 [cs.CV]

[18]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[19]

Shiyi He, Chang Xu, Tianyu Guo, Chao Xu, and Dacheng Tao. 2018. Reinforced multi-label image classification by exploring curriculum. In Thirty-Second AAAI Conference on Artificial Intelligence.

[20]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[21]

Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583--5594.

[22]

Jack Lanchantin, Tianlu Wang, Vicente Ordonez, and Yanjun Qi. 2021. General Multi-label Image Classification with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16478--16488.

[23]

Xin Li, Feipeng Zhao, and Yuhong Guo. 2014. Multi-label Image Classification with A Probabilistic Label Enhancement Model. In UAI, Vol. 1. 1--10.

[24]

Yining Li, Chen Huang, Chen Change Loy, and Xiaoou Tang. 2016. Human attribute recognition by deep hierarchical contexts. In European Conference on Computer Vision. Springer, 684--700.

[25]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.

[26]

Shilong Liu, Lei Zhang, Xiao Yang, Hang Su, and Jun Zhu. 2021. Query2label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834 (2021).

[27]

Yongcheng Liu, Lu Sheng, Jing Shao, Junjie Yan, Shiming Xiang, and Chunhong Pan. 2018. Multi-label image classification via knowledge distillation from weakly supervised detection. In Proceedings of the 26th ACM international conference on Multimedia. 700--708.

Digital Library

[28]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.

[29]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).

[30]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019).

[31]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.

[32]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).

[33]

Tal Ridnik, Emanuel Ben-Baruch, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik-Manor. 2021. Asymmetric loss for multi-label classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 82--91.

[34]

Tal Ridnik, Hussam Lawen, Asaf Noy, Emanuel Ben Baruch, Gilad Sharir, and Itamar Friedman. 2021. Tresnet: High performance gpu-dedicated architecture. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1400--1409.

[35]

Jing Shao, Kai Kang, Chen Change Loy, and Xiaogang Wang. 2015. Deeply learned attributes for crowded scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4657--4666.

[36]

Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. 2021. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7262--7272.

[37]

Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. 2013. Selective search for object recognition. International journal of computer vision 104, 2 (2013), 154--171.

[38]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[39]

Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. 2016. Cnn-rnn: A unified framework for multi-label image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2285--2294.

[40]

Meng Wang, Changzhi Luo, Richang Hong, Jinhui Tang, and Jiashi Feng. 2016. Beyond object proposals: Random crop pooling for multi-label image recognition. IEEE Transactions on Image Processing 25, 12 (2016), 5678--5688.

Digital Library

[41]

Ya Wang, Dongliang He, Fu Li, Xiang Long, Zhichao Zhou, Jinwen Ma, and Shilei Wen. 2020. Multi-label classification with label graph superimposing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12265--12272.

[42]

Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, and Liang Lin. 2017. Multi-label image recognition by recurrently discovering attentional regions. In Proceedings of the IEEE international conference on computer vision. 464--472.

[43]

Shikui Wei, Lixin Liao, Jia Li, Qinjie Zheng, Fei Yang, and Yao Zhao. 2019. Saliency inside: Learning attentive CNNs for content-based image retrieval. IEEE Transactions on Image Processing 28, 9 (2019), 4580--4593.

[44]

Yunchao Wei,Wei Xia, Min Lin, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, and Shuicheng Yan. 2015. HCP: A flexible CNN framework for multi-label image classification. IEEE transactions on pattern analysis and machine intelligence 38, 9 (2015), 1901--1907.

[45]

Yanan Wu, He Liu, Songhe Feng, Yi Jin, Gengyu Lyu, and Zizhang Wu. 2021. GM-MLIC: Graph Matching based Multi-Label Image Classification. arXiv:2104.14762 [cs.CV]

[46]

Jiahao Xu, Hongda Tian, Zhiyong Wang, Yang Wang, Wenxiong Kang, and Fang Chen. 2020. Joint input and output space learning for multi-label image classification. IEEE Transactions on Multimedia 23 (2020), 1696--1707.

Digital Library

[47]

Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, JianxinWu, and Jianfei Cai. 2016. Exploit bounding box annotations for multi-label object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 280--288.

[48]

Xitong Yang, Yuncheng Li, and Jiebo Luo. 2015. Pinterest board recommendation for twitter users. In Proceedings of the 23rd ACM international conference on Multimedia. 963--966.

Digital Library

[49]

Vacit Oguz Yazici, Abel Gonzalez-Garcia, Arnau Ramisa, Bartlomiej Twardowski, and Joost van de Weijer. 2020. Orderless recurrent models for multi-label classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13440--13449.

[50]

Jin Ye, Junjun He, Xiaojiang Peng, Wenhao Wu, and Yu Qiao. 2020. Attention driven dynamic graph convolutional network for multi-label image recognition. In European Conference on Computer Vision. Springer, 649--665.

Digital Library

[51]

Renchun You, Zhiyao Guo, Lei Cui, Xiang Long, Yingze Bao, and Shilei Wen. 2020. Cross-modality attention with semantic graph embedding for multi-label classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12709--12716.

[52]

Junjie Zhang, Qi Wu, Chunhua Shen, Jian Zhang, and Jianfeng Lu. 2018. Multilabel image classification with regional latent semantic dependencies. IEEE Transactions on Multimedia 20, 10 (2018), 2801--2813.

[53]

Jiawei Zhao, Yifan Zhao, and Jia Li. 2021. M3TR: Multi-modal Multi-label Recognition with Transformer. In Proceedings of the 29th ACM International Conference on Multimedia. 469--477.

Digital Library

[54]

Feng Zhu, Hongsheng Li, Wanli Ouyang, Nenghai Yu, and Xiaogang Wang. 2017. Learning spatial regularization with image-level supervisions for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5513--5522.

[55]

Ke Zhu and Jianxin Wu. 2021. Residual Attention: A Simple but Effective Method for Multi-Label Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 184--193

Cited By

Zuo XSheng YShen JShan Y(2025)Multilabel Sewer Pipe Defect Recognition with Mask Attention Feature Enhancement and Label Correlation LearningJournal of Computing in Civil Engineering10.1061/JCCEE5.CPENG-597139:1Online publication date: Jan-2025
https://doi.org/10.1061/JCCEE5.CPENG-5971
Huang JWang DHong XQu XXue W(2024)Cross-modality semantic guidance for multi-label image classificationIntelligent Data Analysis10.3233/IDA-23023928:3(633-646)Online publication date: 28-May-2024
https://doi.org/10.3233/IDA-230239
Ge JCao JZhu XZhang XLiu CWang KLiu BCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Consistencies are All You Need for Semi-supervised Vision-Language TrackingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680657(1895-1904)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680657
Show More Cited By

Index Terms

Two-Stream Transformer for Multi-Label Image Classification
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition

Recommendations

Aligning Image Semantics and Label Concepts for Image Multi-Label Classification
Image multi-label classification task is mainly to correctly predict multiple object categories in the images. To capture the correlation between labels, graph convolution network based methods have to manually count the label co-occurrence probability ...
Weak Labeled Multi-Label Active Learning for Image Classification
MM '15: Proceedings of the 23rd ACM international conference on Multimedia

In order to achieve better classification performance with even fewer labeled images, active learning is suitable for these situations. Several active learning methods have been proposed for multi-label image classification, but all of them assume that ...
HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

The task of multi-label image classification involves recognizing multiple objects within a single image. Considering both valuable semantic information contained in the labels and essential visual features presented in the image, tight visual-linguistic ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Purple Mountain Laboratories
Jiangsu Provincial Key Laboratory of Network and Information Security
Natural Science Foundation of Jiangsu Province
Jiangsu Provincial Key Laboratory of Computer Networking Technology
National Key R&D Project of China
Key Laboratory of Computer Network and Information Integration of Ministry of Education of China
National Natural Science Foundation of China

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
661
Total Downloads

Downloads (Last 12 months)275
Downloads (Last 6 weeks)23

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zuo XSheng YShen JShan Y(2025)Multilabel Sewer Pipe Defect Recognition with Mask Attention Feature Enhancement and Label Correlation LearningJournal of Computing in Civil Engineering10.1061/JCCEE5.CPENG-597139:1Online publication date: Jan-2025
https://doi.org/10.1061/JCCEE5.CPENG-5971
Huang JWang DHong XQu XXue W(2024)Cross-modality semantic guidance for multi-label image classificationIntelligent Data Analysis10.3233/IDA-23023928:3(633-646)Online publication date: 28-May-2024
https://doi.org/10.3233/IDA-230239
Ge JCao JZhu XZhang XLiu CWang KLiu BCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Consistencies are All You Need for Semi-supervised Vision-Language TrackingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680657(1895-1904)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680657
Li ZWang RZhu FHan JHu SGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Pyramidal Cross-Modal Transformer with Sustained Visual Guidance for Multi-Label Image ClassificationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658005(740-748)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658005
Zhou WJiang WChen DHu HSu T(2024)Mining Semantic Information With Dual Relation Graph Network for Multi-Label Image ClassificationIEEE Transactions on Multimedia10.1109/TMM.2023.327727926(1143-1157)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3277279
Gui SWang ZChen JZhou XZhang CCao Y(2024)MT4MTL-KD: A Multi-Teacher Knowledge Distillation Framework for Triplet RecognitionIEEE Transactions on Medical Imaging10.1109/TMI.2023.334573643:4(1628-1639)Online publication date: Apr-2024
https://doi.org/10.1109/TMI.2023.3345736
Zhu XLi JCao JTang DLiu JLiu B(2024)Semantic-Guided Representation Enhancement for Multi-Label Image ClassificationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.340825634:10(10036-10049)Online publication date: Oct-2024
https://doi.org/10.1109/TCSVT.2024.3408256
Chen YZhang DHan TMeng XGao MWang T(2024)Label-Guided Cross-Modal Attention Network for Multi-Label Aerial Image ClassificationIEEE Geoscience and Remote Sensing Letters10.1109/LGRS.2024.338856821(1-5)Online publication date: 2024
https://doi.org/10.1109/LGRS.2024.3388568
Li XCheng CHe WChen W(2024)Fusion of Attention-Based Cascaded CNN and Label Dependency-Based GCN for Multi-label Scene Classification of Mining Land2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651404(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651404
Wu XFeng YXu HLin ZChen TLi SQiu SLiu QMa YZhang S(2024)CTransCNNKnowledge-Based Systems10.1016/j.knosys.2023.111030281:COnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.knosys.2023.111030
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents