Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The self-attention-based vision transformer has powerful feature extraction capabilities and has demonstrated competitive performance in several tasks. However, the conventional self-attention mechanism that exhibits global perceptual properties while favoring large-scale objects, room for improvement still remains in terms of performance at other scales during object detection. To circumvent this issue, the dynamic gate-assisted network (DGANet), a novel yet simple framework, is proposed to enhance the multiscale generalization capability of the vision transformer structure. First, we design the dynamic multi-headed self-attention mechanism (DMH-SAM), which dynamically selects the self-attention components and uses a local-to-global self-attention pattern that enables the model to learn features of objects at different scales autonomously, while reducing the computational effort. Then, we propose a dynamic multiscale encoder (DMEncoder), which weights and encodes the feature maps with different perceptual fields to self-adapt the performance gap of the network for each scale object. Extensive ablation and comparison experiments have proven the effectiveness of the proposed method. Its detection accuracy for small, medium and large targets has reached 27.6, 47.4 and 58.5 respectively, even better than the most advanced target detection methods, while its model complexity down 23%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The datasets used during this study are available on “https://image-net.org/download.php”(ImageNet) and “https://cocodataset.org/#download”(COCO), respectively.

References

  1. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 779–788

  2. Pan S-J, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359

    Article  Google Scholar 

  3. Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5987–5995

  4. Huang G, Liu Z, Van Der Maaten L, Weinberger K-Q (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2261–2269

  5. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 936–944

  6. Luo W, Li Y, Urtasun R, Zemel R (2016) Understanding the effective receptive field in deep convolutional neural networks. Advances in Neural Information Processing Systems, pp 4898–4906

  7. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A-N, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in Neural Information Processing Systems, pp 5998–6008

  8. Srinivas, Lin T-Y, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 16514–16524

  9. Pan X, Ge C, Lu R, Song S, Chen G, Huang Z, Huang G (2022) On the integration of self-attention and convolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 805–815

  10. Zhou S, Nie D, Adeli E, Yin J, Lian J, Shen D (2019) High-resolution encoder–decoder networks for low-contrast medical image segmentation. IEEE Trans Image Process 29:461–475

    Article  MathSciNet  Google Scholar 

  11. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495

    Article  Google Scholar 

  12. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations, pp 1–21

  13. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  14. Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) Ccnet: Criss-cross attention for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 603–612

  15. Fan H, Xiong B, Mangalam K, Li Y, Yan Z, Malik J, Feichtenhofer C (2021) Multiscale vision transformers. In: Proceedings of the IEEE International Conference on Computer Vision, pp 6804–6815

  16. Tu Z, Talebi H, Zhang H, Yang F, Milanfar P, Bovik A, Li Y (2022) Maxvit: Multi-axis vision transformer. In: Proceedings of the European Conference on Computer Vision, pp 459–479

  17. Chen Z, Li Y, Bengio S, Si S (2019) You look twice: Gaternet for dynamic filter selection in cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 9164–9172

  18. Wang X, Yu F, Dou Z-Y, Darrell T, Gonzalez J-E (2018) Skipnet: Learning dynamic routing in convolutional networks. In: Proceedings of the European Conference on Computer Vision, pp 409–424

  19. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Proceedings of the European Conference on Computer Vision, pp 213–229

  20. He K, Gkioxari G, Dollár P, Girshick R (2020) Mask r-cnn. IEEE Trans Pattern Anal Mach Intell 42(2):386–397

    Article  Google Scholar 

  21. Cai Z, Vasconcelos N (2018) Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6154–6162

  22. Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE International Conference on Computer Vision, pp 548–558

  23. Chen Y, Dai X, Chen D, Liu M, Dong X, Yuan L, Liu Z (2022) Mobile-former: Bridging mobilenet and transformer. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5270–5279

  24. Howard A-G, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv: 1704.04861

  25. Han K, Xiao A, Wu E, Guo J, Xu C, Wang Y (2021) Transformer in transformer. Advances in Neural Information Processing Systems, pp 15908–15919

  26. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE International Conference on Computer Vision, pp 9992–10002

  27. Zong Z, Song G, Liu Y. Detrs with collaborative hybrid assignments training[C] Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 6748–6758.

  28. Vasu P K A, Gabriel J, Zhu J, Tuzel O, Ranjan A (2023) FastViT: A fast hybrid vision transformer using structural reparameterization. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5785–5795

  29. Hassibi B, Stork D (1993) Second order derivatives for network pruning: Optimal brain surgeon. Advances in Neural Information Processing Systems, pp 164–171

  30. Wu J, Leng C, Wang Y, Hu Q, Cheng J (2016) Quantized convolutional neural networks for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4820–4828

  31. Han S, Mao H, Dally W-J (2016) Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In: Proceedings of the International Conference on Learning Representations, pp 1–14

  32. Li Y, Song L, Chen Y, Li Z, Zhang X, Wang X, Sun J (2020) Learning dynamic routing for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8550–8559

  33. Bolukbasi T, Wang J, Dekel O, Saligrama V (2017) Adaptive neural networks for efficient inference. In: Proceedings of the International Conference on Machine Learning, pp 527–536

  34. Fedus W, Zoph B, Shazeer N (2022) Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J Mach Learn Res 23(120):1–39

    MathSciNet  Google Scholar 

  35. Lin Z, Wang Y, Zhang J et al (2023) DynamicDet: A Unified Dynamic Architecture for Object Detection[C] Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6282–6291.

  36. Rao Y, Zhao W, Liu B, Lu J, Zhou J, Hsieh C-J (2021) Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in Neural Information Processing Systems, pp 13937–13949

  37. Selvaraju R-R, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp 618–626

  38. Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Personal Communication, 2009

  39. Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2021) Deformable detr: Deformable transformers for end-to-end object detection. In: Proceedings of the International Conference on Learning Representations, pp 1–16

  40. Gu J, Kwon H, Wang D, Ye W, Li M, Chen Y-H, Lai L, Chandra V, Pan D-Z (2022) Multi-scale high-resolution vision transformer for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 12084–12093

  41. Li Y, Mao H, Girshick R, He K (2022) Exploring plain vision transformer backbones for object detection. In: Proceedings of the European Conference on Computer Vision, pp 280–296

  42. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C-L (2014) Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision, pp 740–755

  43. Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang Z-H., Tay F-E-H, Feng J, Yan S (2021) Tokens-to-token vit: Training vision transformers from scratch on imagenet. In: Proceedings of the IEEE International Conference on Computer Vision, pp 538–547

  44. Chu X, Tian Z, Zhang B, Wang X, Wei X, Xia H, Shen C (2021) Conditional positional encodings for vision transformers. In: Proceedings of the International Conference on Learning Representations, pp 1–19

  45. Chu X, Tian Z, Wang Y, Zhang B, Ren H, Wei X, Xia H, Shen C (2021) Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, pp 9355–9366

  46. Yuan K, Guo S, Liu Z, Zhou A, Yu F, Wu W (2021) Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE International Conference on Computer Vision, pp 559–568

  47. Srinivas A, Lin T-Y, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 16514–16524

  48. Tan M, Le Q (2021) Efficientnetv2: Smaller models and faster training. In: Proceedings of the International Conference on Machine Learning, pp 10096–10106

  49. Chi C, Wei F, Hu H (2020) Relationnet++: Bridging visual representations for object detection via transformer decoder. Advances in Neural Information Processing Systems, pp 13564–13574

  50. Qiu H, Ma Y, Li Z, Liu S, Sun J (2020) Borderdet: Border feature for dense object detection. In: Proceedings of the European Conference on Computer Vision, pp 549–564

  51. Dai X, Chen Y, Xiao B, Chen D, Liu M, Yuan L, Zhang L (2021) Dynamic head: Unifying object detection heads with attentions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7369–7378

  52. Li X, Wang W, Hu X, Li J, Tang J, Yang J (2021) Generalized focal loss v2: Learning reliable localization quality estimation for dense object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 11627–11636

  53. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 770–778

  54. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4510–4520

Download references

Acknowledgements

This work was supported by the Scientific and Technological Innovation Plan of Shanghai STC, No. 21511102605 and Wuxi Municipal Health Commission Translational Medicine Research Project, No. ZH202102.

Funding

The Scientific and Technological Innovation Plan of Shanghai STC,21511102605,Xiaofeng Lu,Wuxi Municipal Health Commission Translational Medicine Research Project,ZH202102,Xiaofeng Lu

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaofeng Lu.

Ethics declarations

Conflict of interests

The authors have no conflict of interests that are directly or indirectly related to the submission of this publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fang, S., Lu, X., Huang, Y. et al. Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection. Multimed Tools Appl 83, 67213–67229 (2024). https://doi.org/10.1007/s11042-024-18234-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-024-18234-8

Keywords

Navigation