Siamese tracker with temporal information based on transformer-like feature fusion mechanism

Yuexiang Shi^1,3,
Ziping Wu ORCID: orcid.org/0000-0002-9958-6763^1,3,
Yangzhuo Chen² &
…
Jinlong Dong³

273 Accesses
1 Citation
2 Altmetric
Explore all metrics

Abstract

The position of the target in video trackers based on the Siamese network is usually obtained by calculating the similarity score of features between the target template and the predicted region. This method performs poorly in complex scenarios due to such problems as the deformation of the target and motion blur. To solve these problems, this paper proposes a transformer-like feature fusion mechanism to fuse the temporal information of consecutive frames of the video. We separate the encoder and decoder into two parallel branches to accommodate the characteristics of Siamese networks. The features of the target template are enhanced by a transformer-like encoder while temporal feature-related information is fused by using a transformer-like decoder. The results of experiments on the standard OTB100, VOT2018, UAV123, and NFS datasets showed that the proposed network outperforms most mainstream algorithms in the area.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SiamSYB: simple yet better methods to enhance Siamese tracking

Article 26 March 2022

Siamese Network Based Features Fusion for Adaptive Visual Tracking

Online bionic visual siamese tracking based on mixed time-event triggering mechanism

Article 29 September 2022

References

Bao, J., Wang, H., Lv, C., et al.: Iou-guided siamese tracking. Math. Probl. Eng. 2021, 1–10 (2021). https://doi.org/10.1155/2021/9127092
Article Google Scholar
Bertinetto, L., Valmadre, J., Henriques, JF., et al.: Fully-convolutional siamese networks for object tracking. In: European Conference on Computer Vision. Springer, pp 850–865 (2016)
Bhat, G., Danelljan, M., Gool, LV., et al.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191 (2019a)
Bhat, G., Danelljan, M., Gool, LV., et al.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191 (2019b)
Carion, N., Massa, F., Synnaeve, G., et al.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229 (2020)
Chen, M., Radford, A., Child, R., et al.: Generative pretraining from pixels. In: International Conference on Machine Learning, pp. 1691–1703 (2020a)
Chen, T., Kornblith, S., Norouzi, M., et al.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020b)
Danelljan, M., Robinson, A., Khan, FS., et al.: Beyond correlation filters: learning continuous convolution operators for visual tracking. In: European Conference on Computer Vision, pp. 472–488 (2016)
Danelljan, M., Bhat, G., Khan, FS., et al.: Eco: Efficient convolution operators for tracking. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6931–6939 (2017) https://doi.org/10.1109/CVPR.2017.733
Danelljan, M., Bhat, G., Khan, FS., et al.: Atom: Accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669 (2019)
Devlin, J., Chang, MW., Lee, K., et al.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805
Dong, C., Loy, C.C., He, K., et al.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2015)
Article Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2020). arXiv:2010.11929
Fan, H., Lin, L., Yang, F., et al.: Lasot: a high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5374–5383 (2018)
Fu, J., Liu, J., Tian, H., et al.: Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3146–3154 (2019)
Galoogahi, HK., Fagg, A., Huang, C., et al.: Need for speed: A benchmark for higher frame rate object tracking. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1134–1143 (2017). https://doi.org/10.1109/ICCV.2017.128
Gao, P., Yuan, R., Wang, F., et al.: Siamese attentional keypoint network for high performance visual tracking. Knowl. Based Syst. 193(105), 448 (2020)
Google Scholar
Han, Z., Jian, M., Wang, GG.: Convunext: an efficient convolution neural network for medical image segmentation, pp. 114219 (2021). https://doi.org/10.1016/j.knosys.2022.109512
He, A., Luo, C., Tian, X., et al.: A twofold siamese network for real-time object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4834–4843 (2018)
Held, D., Thrun, S., Savarese, S.: Learning to track at 100 fps with deep regression networks. In: European Conference on Computer Vision, pp. 749–765. Springer (2016)
Henriques, J.F., Caseiro, R., Martins, P., et al.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2014)
Article Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015). arXiv:1503.02531
Huang, K., Qin, P., Tu, X., et al.: Siamcam: a real-time siamese network for object tracking with compensating attention mechanism (2022). https://doi.org/10.3390/app12083931
Jian, M., Wang, J., Yu, H., et al.: Visual saliency detection by integrating spatial position prior of object with background cues, pp. 114219 (2021a). https://doi.org/10.1016/j.eswa.2020.114219
Jian, M., Wang, J., Yu, H., et al.: Integrating object proposal with attention networks for video saliency detection, pp 819–830 (2021b). https://doi.org/10.1016/j.ins.2021.08.069D
Jiang, P.T., Hou, Q., Cao, Y., et al.: Integral object mining via online attention accumulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2070–2079 (2019)
Kolesnikov, A., Beyer, L., Zhai, X., et al.: Big transfer (bit): general visual representation learning. In: Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp. 491–507 (2020)
Kristan, M., Leonardis, A., Matas, J., et al.: The sixth visual object tracking vot2018 challenge results. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 3–53 (2019)
Li, B., Wu, W., Wang, Q., et al.: Siamrpn++: evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4282–4291 (2018a)
Li, B., Yan, J., Wu, W., et al.: High performance visual tracking with siamese region proposal network. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8971–8980 (2018b). https://doi.org/10.1109/CVPR.2018.00935
Liu, L., Xing, J., Ai, H., et al.: Hand posture recognition using finger geometric feature. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pp. 565–568. IEEE (2013)
Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for uav tracking. In: European Conference on Computer Vision, pp. 445–461. Springer (2016)
Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4293–4302 (2016). https://doi.org/10.1109/CVPR.2016.465
Pang, H., Xie, M., Liu, C., et al.: Siamese tracking combing frequency channel attention with adaptive template, pp. 2493–2502 (2021)
Rahman, M.M., Ahmed, M.R., Laishram, L., et al.: Siamese high-level feature refine network for visual object tracking. Electronics (2020). https://doi.org/10.3390/electronics9111918
Article Google Scholar
Ren, S., He, K., Girshick, R., et al.: Faster r-cnn: towards real-time object detection with region proposal networks. Adv. Neural. Inf. Process. Syst. 28, 91–99 (2015)
Google Scholar
Tao, R., Gavves, E., Smeulders, AWM.: Siamese instance search for tracking. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1420–1429 (2016). https://doi.org/10.1109/CVPR.2016.158
Valmadre, J., Bertinetto, L., Henriques, J., et al.: End-to-end representation learning for correlation filter based tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2805–2813 (2017)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Wang, G., Luo, C., Sun, X., et al.: Tracking by instance detection: a meta-learning approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6288–6297 (2020)
Wang, Q., Zhang, L., Bertinetto, L., et al.: Fast online object tracking and segmentation: a unifying approach. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1328–1338 (2019). https://doi.org/10.1109/CVPR.2019.00142
Wang, X., Girshick, R., Gupta, A., et al.: Non-local neural networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018). https://doi.org/10.1109/CVPR.2018.00813
Wu, B., Xu, C., Dai, X., et al.: Visual transformers: Token-based image representation and processing for computer vision (2020). arXiv:2006.03677
Wu, Y., Lim, J., Yang, MH.: Online object tracking: a benchmark. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418 (2013). https://doi.org/10.1109/CVPR.2013.312
Xing, J., Ai, H., Lao, S.: Multiple human tracking based on multi-view upper-body detection and discriminative learning. In: 2010 20th International Conference on Pattern Recognition, pp. 1698–1701. IEEE (2010)
Xu, Y., Wang, Z., Li, Z., et al.: Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12549–12556 (2020)
Yang, T., Chan, AB.: Learning dynamic memory networks for object tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 153–169 (2018)
Yu, Y., Xiong, Y., Huang, W., et al.: Deformable siamese attention networks for visual object tracking. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6727–6736 (2020). https://doi.org/10.1109/CVPR42600.2020.00676
Yuan, Y., Huang, L., Guo, J., et al.: Ocnet: Object context network for scene parsing (2018). arXiv:1809.00916
Zhangm, G., Vela, PA.: Good features to track for visual slam. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1373–1382 (2015). https://doi.org/10.1109/CVPR.2015.7298743
Zhang, L., Gonzalez-Garcia, A., Weijer, JVD., et al.: Learning the model update for siamese trackers. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4009–4018 (2019a). https://doi.org/10.1109/ICCV.2019.00411
Zhang, S., He, X., Yan, S.: Latentgnn: Learning efficient non-local relations for visual recognition. In: International Conference on Machine Learning, pp. 7374–7383 (2019b)
Zhang, Z., Peng, H., Fu, J., et al.: Ocean: Object-aware anchor-free tracking. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pp. 771–787. Springer (2020)
Zhao, H., Jia, J., Koltun, V.: Exploring self-attention for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10,076–10,085 (2020)
Zhu, X., Su, W., Lu, L., et al.: Deformable detr: deformable transformers for end-to-end object detection (2020). arXiv:2010.04159

Download references

Acknowledgements

The authors are very grateful to the anonymous reviewers for their valuable comments on improving this article. Additionally, this work was supported of the National Natural Science Foundation of China (Grant Nos. 62172350, 62172349) and Academic Degree and Postgraduate Teaching Reform research project in Hunan Province in 2021(2021JGYB085).

Author information

Authors and Affiliations

College of Computer Science and Cyberspace Security, Xiangtan University, Xiangtan, 411105, Hunan, China
Yuexiang Shi & Ziping Wu
College of Automation and Electronic Information, Xiangtan University, Xiangtan, 411105, Hunan, China
Yangzhuo Chen
Hunan Institute of Weapons and Light Weapons Co., Ltd, Hunan Ordnance Industry Group Co., Ltd, Yiyang, 413001, Hunan, China
Yuexiang Shi, Ziping Wu & Jinlong Dong

Authors

Yuexiang Shi
View author publications
You can also search for this author in PubMed Google Scholar
Ziping Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yangzhuo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jinlong Dong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ziping Wu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shi, Y., Wu, Z., Chen, Y. et al. Siamese tracker with temporal information based on transformer-like feature fusion mechanism. Machine Vision and Applications 34, 59 (2023). https://doi.org/10.1007/s00138-023-01409-y

Download citation

Received: 08 March 2022
Revised: 08 December 2022
Accepted: 11 May 2023
Published: 08 June 2023
DOI: https://doi.org/10.1007/s00138-023-01409-y

Siamese tracker with temporal information based on transformer-like feature fusion mechanism

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SiamSYB: simple yet better methods to enhance Siamese tracking

Siamese Network Based Features Fusion for Adaptive Visual Tracking

Online bionic visual siamese tracking based on mixed time-event triggering mechanism

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Siamese tracker with temporal information based on transformer-like feature fusion mechanism

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SiamSYB: simple yet better methods to enhance Siamese tracking

Siamese Network Based Features Fusion for Adaptive Visual Tracking

Online bionic visual siamese tracking based on mixed time-event triggering mechanism

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation