Abstract
The current mainstream Siamese network cannot maximally discriminate between the target and the background because it cannot fully utilize the features extracted by a feature network. Here we propose a novel tracker called SiamMLT, which employs a convolutional neural network (CNN) as the backbone and transformer for multi-layer feature fusion. To fully exploit both the low-level and high-level features of the CNN network, transformer performs feature fusion to enhance the feature information and information-expression capabilities. Our tracker replaces the subsequent correlation operations with cross-attention, which generates a fusion vector of the template and search regions for subsequent classification and regression operation. Our method largely enhances the network’s ability to discriminate between objects and backgrounds and to perceive objects. The superior performance of our method is demonstrated in extensive experiments and evaluations on five challenging datasets. Finally, SiamMLT tracker achieves compelling performance compared with the state-of-the-art trackers.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The datasets used in the study are available from the corresponding authors according to rea-sonable request.
References
Li P, Wang D, Wang L, Lu H (2018) Deep visual tracking: review and experimental comparison. Pattern Recognit 76:323–338
Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional Siamese networks for object tracking. In: European conference on computer vision. Springer, pp 850–865
Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with Siamese region proposal network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8971–8980
Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6182–6191
Danelljan M, Gool LV, Timofte R (2020) Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7183–7192
Dai K, Zhang Y, Wang D, Li J, Lu H, Yang X (2020) High-performance long-term tracking with meta-updater. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6298–6307
Chen X, Yan B, Zhu J, Wang D, Yang X, Lu H (2021) Transformer tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8126–8135
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement, arXiv preprint arXiv:1804.02767
Hong C, Yu J, Zhang J, Jin X, Lee K-H (2018) Multimodal face-pose estimation with multitask manifold deep learning. IEEE Trans Ind Inform 15(7):3952–3961
Hong C, Yu J, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670
Yu J, Rui Y, Chen B (2013) Exploiting click constraints and multi-view features for image re-ranking. IEEE Trans Multimed 16(1):159–168
Yu J, Rui Y, Tao D (2014) Click prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 23(5):2019–2032
Zhang J, Yang J, Yu J, Fan J (2022) Semisupervised image classification by mutual learning of multiple self-supervised models. Int J Intell Syst 37(5):3117–3141
Yu J, Tan M, Zhang H, Rui Y, Tao D (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell 44(2):563–578
Zhang J, Cao Y, Wu Q (2021) Vector of locally and adaptively aggregated descriptors for image feature representation. Pattern Recognit 116:107952
Danelljan M, Robinson A, Shahbaz Khan F, Felsberg F (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In: European conference on computer vision. Springer, pp 472–488
Henriques JF, Caseiro R, Martins P, Batista J (2014) High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intell 37(3):583–596
Zhang T, Ghanem B, Liu S et al (2012) Robust visual tracking via multi-task sparse learning. In: 2012 IEEE conference on computer vision and pattern recognition, Providence, RI, USA
Xu L, Wei Y, Dong C, Xu C, Diao Z (2021) Wasserstein distance-based auto-encoder tracking. Neural Process Lett 53(3):2305–2329
Wu Y, Cai C, Yeo CK (2022) Siamese centerness prediction network for real-time visual object tracking. Neural Process Lett 66:1–16
Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with Siamese region proposal network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8971–8980
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adva Neural Inf Process Syst 25:66
Xu Y, Wang Z, Li Z, Yuan Y, Yu G (2020) Siamfc++: towards robust and accurate visual tracking with target estimation guidelines. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 12549–12556
Guo D, Wang J, Cui Y, Wang Z, Chen S Siamcar: Siamese fully convolutional classification and regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6269–6277
Chen Z, Zhong B, Li G, Zhang S, Ji R (2020) Siamese box adaptive network for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6668–6677
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:66
Tenney I, Das D, Pavlick E (2019) Bert rediscovers the classical nlp pipeline, arXiv preprint arXiv:1905.05950
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale, arXiv preprint arXiv:2010.11929
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229
Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang Z-H, Tay FE, Feng J, Yan S (2021) Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 558–567
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Wang N, Zhou W, Wang J, Li H (2021) Transformer meets tracker: exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1571–1580
Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 658–666
Wu Y, Lim J, Yang M-H (2015) Object tracking benchmark. IEEE Trans Pattern Analy Mach Intell 9(37):1834–1848
Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder R, Cehovin Zajc L, Vojir T, Bhat G, Lukezic A, Eldesokey A et al (2018) The sixth visual object tracking vot2018 challenge results. In: Proceedings of the European conference on computer vision (ECCV) workshops
Mueller M, Smith N, Ghanem B (2016) A benchmark and simulator for UAV tracking. In: European conference on computer vision. Springer, pp 445–461
Fan H, Lin L, Yang F, Chu P, Deng G, Yu S, Bai H, Xu Y, Liao C, Ling H (2019) Lasot: a high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5374–5383
Kristan M, Matas J, Leonardis A, Felsberg M, Pflugfelder R, Kämäräinen J-K, Chang HJ, Danelljan M, Cehovin L, Lukežič A et al (2021) The ninth visual object tracking vot2021 challenge results. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2711–2738
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Muller M, Bibi A, Giancola S, Alsubaihi S, Ghanem B (2018) Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European conference on computer vision (ECCV), pp 300–317
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Loshchilov I, Hutter F (2017) Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101
Danelljan M, Bhat G, Shahbaz Khan F, Felsberg M (2017) Eco: efficient convolution operators for tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6638–6646
Bhat G, Danelljan M, Gool LV, Timofte R (2019) Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6182–6191
Danelljan R, Gool R, Timofte R (2020) Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7183–7192
Nam R, Han B (2016) Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4293–4302
Zhang Z, Peng H (2019) Deeper and wider siamese networks for real-time visual tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4591–4600
Zhang Z, Peng H, Fu J, Li B, Hu W (2020) Ocean: object-aware anchor-free tracking. In: European conference on computer vision. Springer, pp 771–787
Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W (2018) Distractor-aware Siamese networks for visual object tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 101–117
Danelljan M, Bhat G, Khan FS, Felsberg M (2019) Atom: accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4660–4669
Li P, Chen B, Ouyang W, Wang D, Yang X, Lu H (2019) Gradnet: gradient-guided network for visual object tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6162–6171
Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Convolutional features for correlation filter based visual tracking. In: Proceedings of the IEEE international conference on computer vision workshops, pp 58–66
Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Learning spatially regularized correlation filters for visual tracking. In: Proceedings of the IEEE international conference on computer vision, pp 4310–4318
Valmadre J, Bertinetto L, Henriques J, Vedaldi A, Torr PH (2017) End-to-end representation learning for correlation filter based tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2805–2813
Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr PH (2016) Staple: complementary learners for real-time tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1401–1409
Lukezic A, Matas J, Kristan M (2020) D3s-a discriminative single shot segmentation tracker. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7133–7142
Wang G, Luo C, Sun X, Xiong Z, Zeng Z (2020) Tracking by instance detection: a meta-learning approach. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6288–6297
Yu Y, Xiong Y, Huang W, Scott MR (2020) Deformable Siamese attention networks for visual object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6728–6737
Bhat G, Danelljan M, Gool LV, Timofte R (2020) Know your surroundings: exploiting scene information for object tracking. In: European conference on computer vision. Springer, pp 205–221
Voigtlaender P, Luiten J, Torr PH, Leibe B (2020) Siam r-cnn: visual tracking by re-detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6578–6588
Guo D, Shao Y, Cui Y, Wang Z, Zhang L, Shen C (2021) Graph attention tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9543–9552
Zhang Z, Liu Y, Wang X, Li X, Hu W (2021) Learn to match: Automatic matching network design for visual tracking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13339–13348
Funding
This research was funded by The Zhejiang University-Shandong (Linyi) Modern Agricultural Research Institute Service Local Economic Development Project (Open Project) (Grant No. ZDNY-2021-FWLY02016), the Natural Science Foundation of Shandong Province (Grant No. ZR2019MA030), and National Natural Science Foundation of China (NSFC) (Grant No. 61402212).
Author information
Authors and Affiliations
Contributions
ZW: Methodology, Writing—Reviewing and Editing; HC: Conceptualization, Methodology, Software; LY: Formal analysis; YR: Writing—Original Draft; HT: Writing—Review & Editing, Funding acquisition; XW: Funding acquisition.
Corresponding author
Ethics declarations
Conflict of interest
None.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, Z., Chen, H., Yuan, L. et al. SiamMLT: Siamese Hybrid Multi-layer Transformer Fusion Tracker. Neural Process Lett 55, 9651–9667 (2023). https://doi.org/10.1007/s11063-023-11219-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-023-11219-y