Abstract
The self-attention-based vision transformer has powerful feature extraction capabilities and has demonstrated competitive performance in several tasks. However, the conventional self-attention mechanism that exhibits global perceptual properties while favoring large-scale objects, room for improvement still remains in terms of performance at other scales during object detection. To circumvent this issue, the dynamic gate-assisted network (DGANet), a novel yet simple framework, is proposed to enhance the multiscale generalization capability of the vision transformer structure. First, we design the dynamic multi-headed self-attention mechanism (DMH-SAM), which dynamically selects the self-attention components and uses a local-to-global self-attention pattern that enables the model to learn features of objects at different scales autonomously, while reducing the computational effort. Then, we propose a dynamic multiscale encoder (DMEncoder), which weights and encodes the feature maps with different perceptual fields to self-adapt the performance gap of the network for each scale object. Extensive ablation and comparison experiments have proven the effectiveness of the proposed method. Its detection accuracy for small, medium and large targets has reached 27.6, 47.4 and 58.5 respectively, even better than the most advanced target detection methods, while its model complexity down 23%.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The datasets used during this study are available on “https://image-net.org/download.php”(ImageNet) and “https://cocodataset.org/#download”(COCO), respectively.
References
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 779–788
Pan S-J, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5987–5995
Huang G, Liu Z, Van Der Maaten L, Weinberger K-Q (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2261–2269
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 936–944
Luo W, Li Y, Urtasun R, Zemel R (2016) Understanding the effective receptive field in deep convolutional neural networks. Advances in Neural Information Processing Systems, pp 4898–4906
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A-N, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in Neural Information Processing Systems, pp 5998–6008
Srinivas, Lin T-Y, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 16514–16524
Pan X, Ge C, Lu R, Song S, Chen G, Huang Z, Huang G (2022) On the integration of self-attention and convolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 805–815
Zhou S, Nie D, Adeli E, Yin J, Lian J, Shen D (2019) High-resolution encoder–decoder networks for low-contrast medical image segmentation. IEEE Trans Image Process 29:461–475
Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations, pp 1–21
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) Ccnet: Criss-cross attention for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 603–612
Fan H, Xiong B, Mangalam K, Li Y, Yan Z, Malik J, Feichtenhofer C (2021) Multiscale vision transformers. In: Proceedings of the IEEE International Conference on Computer Vision, pp 6804–6815
Tu Z, Talebi H, Zhang H, Yang F, Milanfar P, Bovik A, Li Y (2022) Maxvit: Multi-axis vision transformer. In: Proceedings of the European Conference on Computer Vision, pp 459–479
Chen Z, Li Y, Bengio S, Si S (2019) You look twice: Gaternet for dynamic filter selection in cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9164–9172
Wang X, Yu F, Dou Z-Y, Darrell T, Gonzalez J-E (2018) Skipnet: Learning dynamic routing in convolutional networks. In: Proceedings of the European Conference on Computer Vision, pp 409–424
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Proceedings of the European Conference on Computer Vision, pp 213–229
He K, Gkioxari G, Dollár P, Girshick R (2020) Mask r-cnn. IEEE Trans Pattern Anal Mach Intell 42(2):386–397
Cai Z, Vasconcelos N (2018) Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6154–6162
Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE International Conference on Computer Vision, pp 548–558
Chen Y, Dai X, Chen D, Liu M, Dong X, Yuan L, Liu Z (2022) Mobile-former: Bridging mobilenet and transformer. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5270–5279
Howard A-G, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv: 1704.04861
Han K, Xiao A, Wu E, Guo J, Xu C, Wang Y (2021) Transformer in transformer. Advances in Neural Information Processing Systems, pp 15908–15919
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE International Conference on Computer Vision, pp 9992–10002
Zong Z, Song G, Liu Y. Detrs with collaborative hybrid assignments training[C] Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 6748–6758.
Vasu P K A, Gabriel J, Zhu J, Tuzel O, Ranjan A (2023) FastViT: A fast hybrid vision transformer using structural reparameterization. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5785–5795
Hassibi B, Stork D (1993) Second order derivatives for network pruning: Optimal brain surgeon. Advances in Neural Information Processing Systems, pp 164–171
Wu J, Leng C, Wang Y, Hu Q, Cheng J (2016) Quantized convolutional neural networks for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4820–4828
Han S, Mao H, Dally W-J (2016) Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In: Proceedings of the International Conference on Learning Representations, pp 1–14
Li Y, Song L, Chen Y, Li Z, Zhang X, Wang X, Sun J (2020) Learning dynamic routing for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8550–8559
Bolukbasi T, Wang J, Dekel O, Saligrama V (2017) Adaptive neural networks for efficient inference. In: Proceedings of the International Conference on Machine Learning, pp 527–536
Fedus W, Zoph B, Shazeer N (2022) Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J Mach Learn Res 23(120):1–39
Lin Z, Wang Y, Zhang J et al (2023) DynamicDet: A Unified Dynamic Architecture for Object Detection[C] Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6282–6291.
Rao Y, Zhao W, Liu B, Lu J, Zhou J, Hsieh C-J (2021) Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in Neural Information Processing Systems, pp 13937–13949
Selvaraju R-R, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp 618–626
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Personal Communication, 2009
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2021) Deformable detr: Deformable transformers for end-to-end object detection. In: Proceedings of the International Conference on Learning Representations, pp 1–16
Gu J, Kwon H, Wang D, Ye W, Li M, Chen Y-H, Lai L, Chandra V, Pan D-Z (2022) Multi-scale high-resolution vision transformer for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 12084–12093
Li Y, Mao H, Girshick R, He K (2022) Exploring plain vision transformer backbones for object detection. In: Proceedings of the European Conference on Computer Vision, pp 280–296
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C-L (2014) Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision, pp 740–755
Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang Z-H., Tay F-E-H, Feng J, Yan S (2021) Tokens-to-token vit: Training vision transformers from scratch on imagenet. In: Proceedings of the IEEE International Conference on Computer Vision, pp 538–547
Chu X, Tian Z, Zhang B, Wang X, Wei X, Xia H, Shen C (2021) Conditional positional encodings for vision transformers. In: Proceedings of the International Conference on Learning Representations, pp 1–19
Chu X, Tian Z, Wang Y, Zhang B, Ren H, Wei X, Xia H, Shen C (2021) Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, pp 9355–9366
Yuan K, Guo S, Liu Z, Zhou A, Yu F, Wu W (2021) Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE International Conference on Computer Vision, pp 559–568
Srinivas A, Lin T-Y, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 16514–16524
Tan M, Le Q (2021) Efficientnetv2: Smaller models and faster training. In: Proceedings of the International Conference on Machine Learning, pp 10096–10106
Chi C, Wei F, Hu H (2020) Relationnet++: Bridging visual representations for object detection via transformer decoder. Advances in Neural Information Processing Systems, pp 13564–13574
Qiu H, Ma Y, Li Z, Liu S, Sun J (2020) Borderdet: Border feature for dense object detection. In: Proceedings of the European Conference on Computer Vision, pp 549–564
Dai X, Chen Y, Xiao B, Chen D, Liu M, Yuan L, Zhang L (2021) Dynamic head: Unifying object detection heads with attentions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7369–7378
Li X, Wang W, Hu X, Li J, Tang J, Yang J (2021) Generalized focal loss v2: Learning reliable localization quality estimation for dense object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 11627–11636
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 770–778
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4510–4520
Acknowledgements
This work was supported by the Scientific and Technological Innovation Plan of Shanghai STC, No. 21511102605 and Wuxi Municipal Health Commission Translational Medicine Research Project, No. ZH202102.
Funding
The Scientific and Technological Innovation Plan of Shanghai STC,21511102605,Xiaofeng Lu,Wuxi Municipal Health Commission Translational Medicine Research Project,ZH202102,Xiaofeng Lu
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
The authors have no conflict of interests that are directly or indirectly related to the submission of this publication.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Fang, S., Lu, X., Huang, Y. et al. Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection. Multimed Tools Appl 83, 67213–67229 (2024). https://doi.org/10.1007/s11042-024-18234-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-024-18234-8