Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

SiamMLT: Siamese Hybrid Multi-layer Transformer Fusion Tracker

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

The current mainstream Siamese network cannot maximally discriminate between the target and the background because it cannot fully utilize the features extracted by a feature network. Here we propose a novel tracker called SiamMLT, which employs a convolutional neural network (CNN) as the backbone and transformer for multi-layer feature fusion. To fully exploit both the low-level and high-level features of the CNN network, transformer performs feature fusion to enhance the feature information and information-expression capabilities. Our tracker replaces the subsequent correlation operations with cross-attention, which generates a fusion vector of the template and search regions for subsequent classification and regression operation. Our method largely enhances the network’s ability to discriminate between objects and backgrounds and to perceive objects. The superior performance of our method is demonstrated in extensive experiments and evaluations on five challenging datasets. Finally, SiamMLT tracker achieves compelling performance compared with the state-of-the-art trackers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The datasets used in the study are available from the corresponding authors according to rea-sonable request.

References

  1. Li P, Wang D, Wang L, Lu H (2018) Deep visual tracking: review and experimental comparison. Pattern Recognit 76:323–338

    Article  Google Scholar 

  2. Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional Siamese networks for object tracking. In: European conference on computer vision. Springer, pp 850–865

  3. Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with Siamese region proposal network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8971–8980

  4. Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6182–6191

  5. Danelljan M, Gool LV, Timofte R (2020) Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7183–7192

  6. Dai K, Zhang Y, Wang D, Li J, Lu H, Yang X (2020) High-performance long-term tracking with meta-updater. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6298–6307

  7. Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021) Transformer tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8126–8135

  8. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  9. Redmon J, Farhadi A (2018) Yolov3: an incremental improvement, arXiv preprint arXiv:1804.02767

  10. Hong C, Yu J, Zhang J, Jin X, Lee K-H (2018) Multimodal face-pose estimation with multitask manifold deep learning. IEEE Trans Ind Inform 15(7):3952–3961

    Article  Google Scholar 

  11. Hong C, Yu J, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670

    Article  MathSciNet  MATH  Google Scholar 

  12. Yu J, Rui Y, Chen B (2013) Exploiting click constraints and multi-view features for image re-ranking. IEEE Trans Multimed 16(1):159–168

    Article  Google Scholar 

  13. Yu J, Rui Y, Tao D (2014) Click prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 23(5):2019–2032

    Article  MathSciNet  MATH  Google Scholar 

  14. Zhang J, Yang J, Yu J, Fan J (2022) Semisupervised image classification by mutual learning of multiple self-supervised models. Int J Intell Syst 37(5):3117–3141

    Article  Google Scholar 

  15. Yu J, Tan M, Zhang H, Rui Y, Tao D (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell 44(2):563–578

    Article  Google Scholar 

  16. Zhang J, Cao Y, Wu Q (2021) Vector of locally and adaptively aggregated descriptors for image feature representation. Pattern Recognit 116:107952

    Article  Google Scholar 

  17. Danelljan M, Robinson A, Shahbaz Khan F, Felsberg F (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In: European conference on computer vision. Springer, pp 472–488

  18. Henriques JF, Caseiro R, Martins P, Batista J (2014) High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intell 37(3):583–596

    Article  Google Scholar 

  19. Zhang T, Ghanem B, Liu S et al (2012) Robust visual tracking via multi-task sparse learning. In: 2012 IEEE conference on computer vision and pattern recognition, Providence, RI, USA

  20. Xu L, Wei Y, Dong C, Xu C, Diao Z (2021) Wasserstein distance-based auto-encoder tracking. Neural Process Lett 53(3):2305–2329

    Article  Google Scholar 

  21. Wu Y, Cai C, Yeo CK (2022) Siamese centerness prediction network for real-time visual object tracking. Neural Process Lett 66:1–16

    Google Scholar 

  22. Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with Siamese region proposal network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8971–8980

  23. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adva Neural Inf Process Syst 25:66

    Google Scholar 

  24. Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) Siamfc++: towards robust and accurate visual tracking with target estimation guidelines. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 12549–12556

  25. Guo D, Wang J, Cui Y, Wang Z, Chen S Siamcar: Siamese fully convolutional classification and regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6269–6277

  26. Chen Z, Zhong B, Li G, Zhang S, Ji R (2020) Siamese box adaptive network for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6668–6677

  27. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:66

    Google Scholar 

  28. Tenney I, Das D, Pavlick E (2019) Bert rediscovers the classical nlp pipeline, arXiv preprint arXiv:1905.05950

  29. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901

    Google Scholar 

  30. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale, arXiv preprint arXiv:2010.11929

  31. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229

  32. Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang Z-H, Tay FE, Feng J, Yan S (2021) Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 558–567

  33. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022

  34. Wang N, Zhou W, Wang J, Li H (2021) Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1571–1580

  35. Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 658–666

  36. Wu Y, Lim J, Yang M-H (2015) Object tracking benchmark. IEEE Trans Pattern Analy Mach Intell 9(37):1834–1848

    Article  Google Scholar 

  37. Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Cehovin Zajc L, Vojir T, Bhat G, Lukezic A, Eldesokey A et al (2018) The sixth visual object tracking vot2018 challenge results. In: Proceedings of the European conference on computer vision (ECCV) workshops

  38. Mueller M, Smith N, Ghanem B (2016) A benchmark and simulator for UAV tracking. In: European conference on computer vision. Springer, pp 445–461

  39. Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, Bai H, Xu Y, Liao C, Ling H (2019) Lasot: a high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5374–5383

  40. Kristan M, Matas J, Leonardis A, Felsberg M, Pflugfelder R, Kämäräinen J-K, Chang HJ, Danelljan M, Cehovin L, Lukežič A et al (2021) The ninth visual object tracking vot2021 challenge results. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2711–2738

  41. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755

  42. Muller M, Bibi A, Giancola S, Alsubaihi S, Ghanem B (2018) Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European conference on computer vision (ECCV), pp 300–317

  43. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  44. Loshchilov I, Hutter F (2017) Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101

  45. Danelljan M, Bhat G, Shahbaz Khan F, Felsberg M (2017) Eco: efficient convolution operators for tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6638–6646

  46. Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6182–6191

  47. Danelljan R, Gool R, Timofte R (2020) Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7183–7192

  48. Nam R, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4293–4302

  49. Zhang Z, Peng H (2019) Deeper and wider siamese networks for real-time visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4591–4600

  50. Zhang Z, Peng H, Fu J, Li B, Hu W (2020) Ocean: object-aware anchor-free tracking. In: European conference on computer vision. Springer, pp 771–787

  51. Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W (2018) Distractor-aware Siamese networks for visual object tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 101–117

  52. Danelljan M, Bhat G, Khan FS, Felsberg M (2019) Atom: accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4660–4669

  53. Li P, Chen B, Ouyang W, Wang D, Yang X, Lu H (2019) Gradnet: gradient-guided network for visual object tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6162–6171

  54. Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Convolutional features for correlation filter based visual tracking. In: Proceedings of the IEEE international conference on computer vision workshops, pp 58–66

  55. Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Learning spatially regularized correlation filters for visual tracking. In: Proceedings of the IEEE international conference on computer vision, pp 4310–4318

  56. Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PH (2017) End-to-end representation learning for correlation filter based tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2805–2813

  57. Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr PH (2016) Staple: complementary learners for real-time tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1401–1409

  58. Lukezic A, Matas J, Kristan M (2020) D3s-a discriminative single shot segmentation tracker. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7133–7142

  59. Wang G, Luo C, Sun X, Xiong Z, Zeng Z (2020) Tracking by instance detection: a meta-learning approach. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6288–6297

  60. Yu Y, Xiong Y, Huang W, Scott MR (2020) Deformable Siamese attention networks for visual object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6728–6737

  61. Bhat G, Danelljan M, Gool LV, Timofte R (2020) Know your surroundings: exploiting scene information for object tracking. In: European conference on computer vision. Springer, pp 205–221

  62. Voigtlaender P, Luiten J, Torr PH, Leibe B (2020) Siam r-cnn: visual tracking by re-detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6578–6588

  63. Guo D, Shao Y, Cui Y, Wang Z, Zhang L, Shen C (2021) Graph attention tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9543–9552

  64. Zhang Z, Liu Y, Wang X, Li X, Hu W (2021) Learn to match: Automatic matching network design for visual tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13339–13348

Download references

Funding

This research was funded by The Zhejiang University-Shandong (Linyi) Modern Agricultural Research Institute Service Local Economic Development Project (Open Project) (Grant No. ZDNY-2021-FWLY02016), the Natural Science Foundation of Shandong Province (Grant No. ZR2019MA030), and National Natural Science Foundation of China (NSFC) (Grant No. 61402212).

Author information

Authors and Affiliations

Authors

Contributions

ZW: Methodology, Writing—Reviewing and Editing; HC: Conceptualization, Methodology, Software; LY: Formal analysis; YR: Writing—Original Draft; HT: Writing—Review & Editing, Funding acquisition; XW: Funding acquisition.

Corresponding author

Correspondence to Hui Chen.

Ethics declarations

Conflict of interest

None.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Z., Chen, H., Yuan, L. et al. SiamMLT: Siamese Hybrid Multi-layer Transformer Fusion Tracker. Neural Process Lett 55, 9651–9667 (2023). https://doi.org/10.1007/s11063-023-11219-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-023-11219-y

Keywords

Navigation