Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection

Sikai Fang¹,
Xiaofeng Lu^1,2,
Yifan Huang¹,
Guangling Sun¹ &
…
Xuefeng Liu¹

231 Accesses
1 Altmetric
Explore all metrics

Abstract

The self-attention-based vision transformer has powerful feature extraction capabilities and has demonstrated competitive performance in several tasks. However, the conventional self-attention mechanism that exhibits global perceptual properties while favoring large-scale objects, room for improvement still remains in terms of performance at other scales during object detection. To circumvent this issue, the dynamic gate-assisted network (DGANet), a novel yet simple framework, is proposed to enhance the multiscale generalization capability of the vision transformer structure. First, we design the dynamic multi-headed self-attention mechanism (DMH-SAM), which dynamically selects the self-attention components and uses a local-to-global self-attention pattern that enables the model to learn features of objects at different scales autonomously, while reducing the computational effort. Then, we propose a dynamic multiscale encoder (DMEncoder), which weights and encodes the feature maps with different perceptual fields to self-adapt the performance gap of the network for each scale object. Extensive ablation and comparison experiments have proven the effectiveness of the proposed method. Its detection accuracy for small, medium and large targets has reached 27.6, 47.4 and 58.5 respectively, even better than the most advanced target detection methods, while its model complexity down 23%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-scale coupled attention for visual object detection

Article Open access 16 May 2024

AG-YOLO: Attention-guided network for real-time object detection

Article 04 September 2023

STFormer: Cross-Level Feature Fusion in Object Detection

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The datasets used during this study are available on “https://image-net.org/download.php”(ImageNet) and “https://cocodataset.org/#download”(COCO), respectively.

References

Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 779–788
Pan S-J, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Article Google Scholar
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5987–5995
Huang G, Liu Z, Van Der Maaten L, Weinberger K-Q (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2261–2269
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 936–944
Luo W, Li Y, Urtasun R, Zemel R (2016) Understanding the effective receptive field in deep convolutional neural networks. Advances in Neural Information Processing Systems, pp 4898–4906
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A-N, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in Neural Information Processing Systems, pp 5998–6008
Srinivas, Lin T-Y, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 16514–16524
Pan X, Ge C, Lu R, Song S, Chen G, Huang Z, Huang G (2022) On the integration of self-attention and convolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 805–815
Zhou S, Nie D, Adeli E, Yin J, Lian J, Shen D (2019) High-resolution encoder–decoder networks for low-contrast medical image segmentation. IEEE Trans Image Process 29:461–475
Article MathSciNet Google Scholar
Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495
Article Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations, pp 1–21
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) Ccnet: Criss-cross attention for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 603–612
Fan H, Xiong B, Mangalam K, Li Y, Yan Z, Malik J, Feichtenhofer C (2021) Multiscale vision transformers. In: Proceedings of the IEEE International Conference on Computer Vision, pp 6804–6815
Tu Z, Talebi H, Zhang H, Yang F, Milanfar P, Bovik A, Li Y (2022) Maxvit: Multi-axis vision transformer. In: Proceedings of the European Conference on Computer Vision, pp 459–479
Chen Z, Li Y, Bengio S, Si S (2019) You look twice: Gaternet for dynamic filter selection in cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9164–9172
Wang X, Yu F, Dou Z-Y, Darrell T, Gonzalez J-E (2018) Skipnet: Learning dynamic routing in convolutional networks. In: Proceedings of the European Conference on Computer Vision, pp 409–424
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Proceedings of the European Conference on Computer Vision, pp 213–229
He K, Gkioxari G, Dollár P, Girshick R (2020) Mask r-cnn. IEEE Trans Pattern Anal Mach Intell 42(2):386–397
Article Google Scholar
Cai Z, Vasconcelos N (2018) Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6154–6162
Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE International Conference on Computer Vision, pp 548–558
Chen Y, Dai X, Chen D, Liu M, Dong X, Yuan L, Liu Z (2022) Mobile-former: Bridging mobilenet and transformer. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5270–5279
Howard A-G, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv: 1704.04861
Han K, Xiao A, Wu E, Guo J, Xu C, Wang Y (2021) Transformer in transformer. Advances in Neural Information Processing Systems, pp 15908–15919
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE International Conference on Computer Vision, pp 9992–10002
Zong Z, Song G, Liu Y. Detrs with collaborative hybrid assignments training[C] Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 6748–6758.
Vasu P K A, Gabriel J, Zhu J, Tuzel O, Ranjan A (2023) FastViT: A fast hybrid vision transformer using structural reparameterization. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5785–5795
Hassibi B, Stork D (1993) Second order derivatives for network pruning: Optimal brain surgeon. Advances in Neural Information Processing Systems, pp 164–171
Wu J, Leng C, Wang Y, Hu Q, Cheng J (2016) Quantized convolutional neural networks for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4820–4828
Han S, Mao H, Dally W-J (2016) Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In: Proceedings of the International Conference on Learning Representations, pp 1–14
Li Y, Song L, Chen Y, Li Z, Zhang X, Wang X, Sun J (2020) Learning dynamic routing for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8550–8559
Bolukbasi T, Wang J, Dekel O, Saligrama V (2017) Adaptive neural networks for efficient inference. In: Proceedings of the International Conference on Machine Learning, pp 527–536
Fedus W, Zoph B, Shazeer N (2022) Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J Mach Learn Res 23(120):1–39
MathSciNet Google Scholar
Lin Z, Wang Y, Zhang J et al (2023) DynamicDet: A Unified Dynamic Architecture for Object Detection[C] Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6282–6291.
Rao Y, Zhao W, Liu B, Lu J, Zhou J, Hsieh C-J (2021) Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in Neural Information Processing Systems, pp 13937–13949
Selvaraju R-R, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp 618–626
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Personal Communication, 2009
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2021) Deformable detr: Deformable transformers for end-to-end object detection. In: Proceedings of the International Conference on Learning Representations, pp 1–16
Gu J, Kwon H, Wang D, Ye W, Li M, Chen Y-H, Lai L, Chandra V, Pan D-Z (2022) Multi-scale high-resolution vision transformer for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 12084–12093
Li Y, Mao H, Girshick R, He K (2022) Exploring plain vision transformer backbones for object detection. In: Proceedings of the European Conference on Computer Vision, pp 280–296
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C-L (2014) Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision, pp 740–755
Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang Z-H., Tay F-E-H, Feng J, Yan S (2021) Tokens-to-token vit: Training vision transformers from scratch on imagenet. In: Proceedings of the IEEE International Conference on Computer Vision, pp 538–547
Chu X, Tian Z, Zhang B, Wang X, Wei X, Xia H, Shen C (2021) Conditional positional encodings for vision transformers. In: Proceedings of the International Conference on Learning Representations, pp 1–19
Chu X, Tian Z, Wang Y, Zhang B, Ren H, Wei X, Xia H, Shen C (2021) Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, pp 9355–9366
Yuan K, Guo S, Liu Z, Zhou A, Yu F, Wu W (2021) Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE International Conference on Computer Vision, pp 559–568
Srinivas A, Lin T-Y, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 16514–16524
Tan M, Le Q (2021) Efficientnetv2: Smaller models and faster training. In: Proceedings of the International Conference on Machine Learning, pp 10096–10106
Chi C, Wei F, Hu H (2020) Relationnet++: Bridging visual representations for object detection via transformer decoder. Advances in Neural Information Processing Systems, pp 13564–13574
Qiu H, Ma Y, Li Z, Liu S, Sun J (2020) Borderdet: Border feature for dense object detection. In: Proceedings of the European Conference on Computer Vision, pp 549–564
Dai X, Chen Y, Xiao B, Chen D, Liu M, Yuan L, Zhang L (2021) Dynamic head: Unifying object detection heads with attentions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7369–7378
Li X, Wang W, Hu X, Li J, Tang J, Yang J (2021) Generalized focal loss v2: Learning reliable localization quality estimation for dense object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 11627–11636
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 770–778
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4510–4520

Download references

Acknowledgements

This work was supported by the Scientific and Technological Innovation Plan of Shanghai STC, No. 21511102605 and Wuxi Municipal Health Commission Translational Medicine Research Project, No. ZH202102.

Funding

The Scientific and Technological Innovation Plan of Shanghai STC,21511102605,Xiaofeng Lu,Wuxi Municipal Health Commission Translational Medicine Research Project,ZH202102,Xiaofeng Lu

Author information

Authors and Affiliations

School of Communication and Information Engineering, Shanghai University, NO.99 Shangda Road, Baoshan District, Shanghai, 200444, China
Sikai Fang, Xiaofeng Lu, Yifan Huang, Guangling Sun & Xuefeng Liu
Wenzhou Institute of Shanghai University, Wenzhou, China
Xiaofeng Lu

Authors

Sikai Fang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofeng Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yifan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Guangling Sun
View author publications
You can also search for this author in PubMed Google Scholar
Xuefeng Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaofeng Lu.

Ethics declarations

Conflict of interests

The authors have no conflict of interests that are directly or indirectly related to the submission of this publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Fang, S., Lu, X., Huang, Y. et al. Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection. Multimed Tools Appl 83, 67213–67229 (2024). https://doi.org/10.1007/s11042-024-18234-8

Download citation

Received: 13 April 2023
Revised: 11 November 2023
Accepted: 07 January 2024
Published: 24 January 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s11042-024-18234-8

Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-scale coupled attention for visual object detection

AG-YOLO: Attention-guided network for real-time object detection

STFormer: Cross-Level Feature Fusion in Object Detection

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-scale coupled attention for visual object detection

AG-YOLO: Attention-guided network for real-time object detection

STFormer: Cross-Level Feature Fusion in Object Detection

Explore related subjects

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation