SiamMLT: Siamese Hybrid Multi-layer Transformer Fusion Tracker

Zhenhai Wang¹,
Hui Chen¹,
Lutao Yuan¹,
Ying Ren¹,
Hongyu Tian² &
…
Xing Wang¹

238 Accesses
1 Altmetric
Explore all metrics

Abstract

The current mainstream Siamese network cannot maximally discriminate between the target and the background because it cannot fully utilize the features extracted by a feature network. Here we propose a novel tracker called SiamMLT, which employs a convolutional neural network (CNN) as the backbone and transformer for multi-layer feature fusion. To fully exploit both the low-level and high-level features of the CNN network, transformer performs feature fusion to enhance the feature information and information-expression capabilities. Our tracker replaces the subsequent correlation operations with cross-attention, which generates a fusion vector of the template and search regions for subsequent classification and regression operation. Our method largely enhances the network’s ability to discriminate between objects and backgrounds and to perceive objects. The superior performance of our method is demonstrated in extensive experiments and evaluations on five challenging datasets. Finally, SiamMLT tracker achieves compelling performance compared with the state-of-the-art trackers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MHASiam: Mixed High-Order Attention Siamese Network for Real-Time Visual Tracking

Deep Siamese Network with Co-channel and Cr-Spatial Attention for Object Tracking

Multi-granularity Feature Fusion for Transformer-Based Single Object Tracking

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The datasets used in the study are available from the corresponding authors according to rea-sonable request.

References

Li P, Wang D, Wang L, Lu H (2018) Deep visual tracking: review and experimental comparison. Pattern Recognit 76:323–338
Article Google Scholar
Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional Siamese networks for object tracking. In: European conference on computer vision. Springer, pp 850–865
Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with Siamese region proposal network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8971–8980
Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6182–6191
Danelljan M, Gool LV, Timofte R (2020) Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7183–7192
Dai K, Zhang Y, Wang D, Li J, Lu H, Yang X (2020) High-performance long-term tracking with meta-updater. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6298–6307
Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021) Transformer tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8126–8135
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement, arXiv preprint arXiv:1804.02767
Hong C, Yu J, Zhang J, Jin X, Lee K-H (2018) Multimodal face-pose estimation with multitask manifold deep learning. IEEE Trans Ind Inform 15(7):3952–3961
Article Google Scholar
Hong C, Yu J, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670
Article MathSciNet MATH Google Scholar
Yu J, Rui Y, Chen B (2013) Exploiting click constraints and multi-view features for image re-ranking. IEEE Trans Multimed 16(1):159–168
Article Google Scholar
Yu J, Rui Y, Tao D (2014) Click prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 23(5):2019–2032
Article MathSciNet MATH Google Scholar
Zhang J, Yang J, Yu J, Fan J (2022) Semisupervised image classification by mutual learning of multiple self-supervised models. Int J Intell Syst 37(5):3117–3141
Article Google Scholar
Yu J, Tan M, Zhang H, Rui Y, Tao D (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell 44(2):563–578
Article Google Scholar
Zhang J, Cao Y, Wu Q (2021) Vector of locally and adaptively aggregated descriptors for image feature representation. Pattern Recognit 116:107952
Article Google Scholar
Danelljan M, Robinson A, Shahbaz Khan F, Felsberg F (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In: European conference on computer vision. Springer, pp 472–488
Henriques JF, Caseiro R, Martins P, Batista J (2014) High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intell 37(3):583–596
Article Google Scholar
Zhang T, Ghanem B, Liu S et al (2012) Robust visual tracking via multi-task sparse learning. In: 2012 IEEE conference on computer vision and pattern recognition, Providence, RI, USA
Xu L, Wei Y, Dong C, Xu C, Diao Z (2021) Wasserstein distance-based auto-encoder tracking. Neural Process Lett 53(3):2305–2329
Article Google Scholar
Wu Y, Cai C, Yeo CK (2022) Siamese centerness prediction network for real-time visual object tracking. Neural Process Lett 66:1–16
Google Scholar
Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with Siamese region proposal network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8971–8980
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adva Neural Inf Process Syst 25:66
Google Scholar
Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) Siamfc++: towards robust and accurate visual tracking with target estimation guidelines. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 12549–12556
Guo D, Wang J, Cui Y, Wang Z, Chen S Siamcar: Siamese fully convolutional classification and regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6269–6277
Chen Z, Zhong B, Li G, Zhang S, Ji R (2020) Siamese box adaptive network for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6668–6677
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:66
Google Scholar
Tenney I, Das D, Pavlick E (2019) Bert rediscovers the classical nlp pipeline, arXiv preprint arXiv:1905.05950
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale, arXiv preprint arXiv:2010.11929
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229
Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang Z-H, Tay FE, Feng J, Yan S (2021) Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 558–567
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Wang N, Zhou W, Wang J, Li H (2021) Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1571–1580
Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 658–666
Wu Y, Lim J, Yang M-H (2015) Object tracking benchmark. IEEE Trans Pattern Analy Mach Intell 9(37):1834–1848
Article Google Scholar
Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Cehovin Zajc L, Vojir T, Bhat G, Lukezic A, Eldesokey A et al (2018) The sixth visual object tracking vot2018 challenge results. In: Proceedings of the European conference on computer vision (ECCV) workshops
Mueller M, Smith N, Ghanem B (2016) A benchmark and simulator for UAV tracking. In: European conference on computer vision. Springer, pp 445–461
Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, Bai H, Xu Y, Liao C, Ling H (2019) Lasot: a high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5374–5383
Kristan M, Matas J, Leonardis A, Felsberg M, Pflugfelder R, Kämäräinen J-K, Chang HJ, Danelljan M, Cehovin L, Lukežič A et al (2021) The ninth visual object tracking vot2021 challenge results. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2711–2738
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Muller M, Bibi A, Giancola S, Alsubaihi S, Ghanem B (2018) Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European conference on computer vision (ECCV), pp 300–317
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Loshchilov I, Hutter F (2017) Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101
Danelljan M, Bhat G, Shahbaz Khan F, Felsberg M (2017) Eco: efficient convolution operators for tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6638–6646
Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6182–6191
Danelljan R, Gool R, Timofte R (2020) Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7183–7192
Nam R, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4293–4302
Zhang Z, Peng H (2019) Deeper and wider siamese networks for real-time visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4591–4600
Zhang Z, Peng H, Fu J, Li B, Hu W (2020) Ocean: object-aware anchor-free tracking. In: European conference on computer vision. Springer, pp 771–787
Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W (2018) Distractor-aware Siamese networks for visual object tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 101–117
Danelljan M, Bhat G, Khan FS, Felsberg M (2019) Atom: accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4660–4669
Li P, Chen B, Ouyang W, Wang D, Yang X, Lu H (2019) Gradnet: gradient-guided network for visual object tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6162–6171
Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Convolutional features for correlation filter based visual tracking. In: Proceedings of the IEEE international conference on computer vision workshops, pp 58–66
Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Learning spatially regularized correlation filters for visual tracking. In: Proceedings of the IEEE international conference on computer vision, pp 4310–4318
Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PH (2017) End-to-end representation learning for correlation filter based tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2805–2813
Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr PH (2016) Staple: complementary learners for real-time tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1401–1409
Lukezic A, Matas J, Kristan M (2020) D3s-a discriminative single shot segmentation tracker. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7133–7142
Wang G, Luo C, Sun X, Xiong Z, Zeng Z (2020) Tracking by instance detection: a meta-learning approach. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6288–6297
Yu Y, Xiong Y, Huang W, Scott MR (2020) Deformable Siamese attention networks for visual object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6728–6737
Bhat G, Danelljan M, Gool LV, Timofte R (2020) Know your surroundings: exploiting scene information for object tracking. In: European conference on computer vision. Springer, pp 205–221
Voigtlaender P, Luiten J, Torr PH, Leibe B (2020) Siam r-cnn: visual tracking by re-detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6578–6588
Guo D, Shao Y, Cui Y, Wang Z, Zhang L, Shen C (2021) Graph attention tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9543–9552
Zhang Z, Liu Y, Wang X, Li X, Hu W (2021) Learn to match: Automatic matching network design for visual tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13339–13348

Download references

Funding

This research was funded by The Zhejiang University-Shandong (Linyi) Modern Agricultural Research Institute Service Local Economic Development Project (Open Project) (Grant No. ZDNY-2021-FWLY02016), the Natural Science Foundation of Shandong Province (Grant No. ZR2019MA030), and National Natural Science Foundation of China (NSFC) (Grant No. 61402212).

Author information

Authors and Affiliations

College of Information Science and Engineering, Linyi University, Linyi, 276000, Shandong, China
Zhenhai Wang, Hui Chen, Lutao Yuan, Ying Ren & Xing Wang
School of Physics and Electronic Engineering, Linyi University, Linyi, 276000, Shandong, China
Hongyu Tian

Authors

Zhenhai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hui Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lutao Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Ying Ren
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Tian
View author publications
You can also search for this author in PubMed Google Scholar
Xing Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ZW: Methodology, Writing—Reviewing and Editing; HC: Conceptualization, Methodology, Software; LY: Formal analysis; YR: Writing—Original Draft; HT: Writing—Review & Editing, Funding acquisition; XW: Funding acquisition.

Corresponding author

Correspondence to Hui Chen.

Ethics declarations

Conflict of interest

None.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, Z., Chen, H., Yuan, L. et al. SiamMLT: Siamese Hybrid Multi-layer Transformer Fusion Tracker. Neural Process Lett 55, 9651–9667 (2023). https://doi.org/10.1007/s11063-023-11219-y

Download citation

Accepted: 26 February 2023
Published: 11 March 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s11063-023-11219-y

SiamMLT: Siamese Hybrid Multi-layer Transformer Fusion Tracker

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MHASiam: Mixed High-Order Attention Siamese Network for Real-Time Visual Tracking

Deep Siamese Network with Co-channel and Cr-Spatial Attention for Object Tracking

Multi-granularity Feature Fusion for Transformer-Based Single Object Tracking

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

SiamMLT: Siamese Hybrid Multi-layer Transformer Fusion Tracker

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MHASiam: Mixed High-Order Attention Siamese Network for Real-Time Visual Tracking

Deep Siamese Network with Co-channel and Cr-Spatial Attention for Object Tracking

Multi-granularity Feature Fusion for Transformer-Based Single Object Tracking

Explore related subjects

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation