Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Siamese tracker with temporal information based on transformer-like feature fusion mechanism

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

The position of the target in video trackers based on the Siamese network is usually obtained by calculating the similarity score of features between the target template and the predicted region. This method performs poorly in complex scenarios due to such problems as the deformation of the target and motion blur. To solve these problems, this paper proposes a transformer-like feature fusion mechanism to fuse the temporal information of consecutive frames of the video. We separate the encoder and decoder into two parallel branches to accommodate the characteristics of Siamese networks. The features of the target template are enhanced by a transformer-like encoder while temporal feature-related information is fused by using a transformer-like decoder. The results of experiments on the standard OTB100, VOT2018, UAV123, and NFS datasets showed that the proposed network outperforms most mainstream algorithms in the area.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Bao, J., Wang, H., Lv, C., et al.: Iou-guided siamese tracking. Math. Probl. Eng. 2021, 1–10 (2021). https://doi.org/10.1155/2021/9127092

    Article  Google Scholar 

  2. Bertinetto, L., Valmadre, J., Henriques, JF., et al.: Fully-convolutional siamese networks for object tracking. In: European Conference on Computer Vision. Springer, pp 850–865 (2016)

  3. Bhat, G., Danelljan, M., Gool, LV., et al.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191 (2019a)

  4. Bhat, G., Danelljan, M., Gool, LV., et al.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191 (2019b)

  5. Carion, N., Massa, F., Synnaeve, G., et al.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229 (2020)

  6. Chen, M., Radford, A., Child, R., et al.: Generative pretraining from pixels. In: International Conference on Machine Learning, pp. 1691–1703 (2020a)

  7. Chen, T., Kornblith, S., Norouzi, M., et al.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020b)

  8. Danelljan, M., Robinson, A., Khan, FS., et al.: Beyond correlation filters: learning continuous convolution operators for visual tracking. In: European Conference on Computer Vision, pp. 472–488 (2016)

  9. Danelljan, M., Bhat, G., Khan, FS., et al.: Eco: Efficient convolution operators for tracking. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6931–6939 (2017) https://doi.org/10.1109/CVPR.2017.733

  10. Danelljan, M., Bhat, G., Khan, FS., et al.: Atom: Accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669 (2019)

  11. Devlin, J., Chang, MW., Lee, K., et al.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805

  12. Dong, C., Loy, C.C., He, K., et al.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2015)

    Article  Google Scholar 

  13. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2020). arXiv:2010.11929

  14. Fan, H., Lin, L., Yang, F., et al.: Lasot: a high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5374–5383 (2018)

  15. Fu, J., Liu, J., Tian, H., et al.: Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3146–3154 (2019)

  16. Galoogahi, HK., Fagg, A., Huang, C., et al.: Need for speed: A benchmark for higher frame rate object tracking. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1134–1143 (2017). https://doi.org/10.1109/ICCV.2017.128

  17. Gao, P., Yuan, R., Wang, F., et al.: Siamese attentional keypoint network for high performance visual tracking. Knowl. Based Syst. 193(105), 448 (2020)

    Google Scholar 

  18. Han, Z., Jian, M., Wang, GG.: Convunext: an efficient convolution neural network for medical image segmentation, pp. 114219 (2021). https://doi.org/10.1016/j.knosys.2022.109512

  19. He, A., Luo, C., Tian, X., et al.: A twofold siamese network for real-time object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4834–4843 (2018)

  20. Held, D., Thrun, S., Savarese, S.: Learning to track at 100 fps with deep regression networks. In: European Conference on Computer Vision, pp. 749–765. Springer (2016)

  21. Henriques, J.F., Caseiro, R., Martins, P., et al.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2014)

    Article  Google Scholar 

  22. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015). arXiv:1503.02531

  23. Huang, K., Qin, P., Tu, X., et al.: Siamcam: a real-time siamese network for object tracking with compensating attention mechanism (2022). https://doi.org/10.3390/app12083931

  24. Jian, M., Wang, J., Yu, H., et al.: Visual saliency detection by integrating spatial position prior of object with background cues, pp. 114219 (2021a). https://doi.org/10.1016/j.eswa.2020.114219

  25. Jian, M., Wang, J., Yu, H., et al.: Integrating object proposal with attention networks for video saliency detection, pp 819–830 (2021b). https://doi.org/10.1016/j.ins.2021.08.069D

  26. Jiang, P.T., Hou, Q., Cao, Y., et al.: Integral object mining via online attention accumulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2070–2079 (2019)

  27. Kolesnikov, A., Beyer, L., Zhai, X., et al.: Big transfer (bit): general visual representation learning. In: Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp. 491–507 (2020)

  28. Kristan, M., Leonardis, A., Matas, J., et al.: The sixth visual object tracking vot2018 challenge results. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 3–53 (2019)

  29. Li, B., Wu, W., Wang, Q., et al.: Siamrpn++: evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4282–4291 (2018a)

  30. Li, B., Yan, J., Wu, W., et al.: High performance visual tracking with siamese region proposal network. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8971–8980 (2018b). https://doi.org/10.1109/CVPR.2018.00935

  31. Liu, L., Xing, J., Ai, H., et al.: Hand posture recognition using finger geometric feature. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pp. 565–568. IEEE (2013)

  32. Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for uav tracking. In: European Conference on Computer Vision, pp. 445–461. Springer (2016)

  33. Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4293–4302 (2016). https://doi.org/10.1109/CVPR.2016.465

  34. Pang, H., Xie, M., Liu, C., et al.: Siamese tracking combing frequency channel attention with adaptive template, pp. 2493–2502 (2021)

  35. Rahman, M.M., Ahmed, M.R., Laishram, L., et al.: Siamese high-level feature refine network for visual object tracking. Electronics (2020). https://doi.org/10.3390/electronics9111918

    Article  Google Scholar 

  36. Ren, S., He, K., Girshick, R., et al.: Faster r-cnn: towards real-time object detection with region proposal networks. Adv. Neural. Inf. Process. Syst. 28, 91–99 (2015)

    Google Scholar 

  37. Tao, R., Gavves, E., Smeulders, AWM.: Siamese instance search for tracking. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1420–1429 (2016). https://doi.org/10.1109/CVPR.2016.158

  38. Valmadre, J., Bertinetto, L., Henriques, J., et al.: End-to-end representation learning for correlation filter based tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2805–2813 (2017)

  39. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

  40. Wang, G., Luo, C., Sun, X., et al.: Tracking by instance detection: a meta-learning approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6288–6297 (2020)

  41. Wang, Q., Zhang, L., Bertinetto, L., et al.: Fast online object tracking and segmentation: a unifying approach. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1328–1338 (2019). https://doi.org/10.1109/CVPR.2019.00142

  42. Wang, X., Girshick, R., Gupta, A., et al.: Non-local neural networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018). https://doi.org/10.1109/CVPR.2018.00813

  43. Wu, B., Xu, C., Dai, X., et al.: Visual transformers: Token-based image representation and processing for computer vision (2020). arXiv:2006.03677

  44. Wu, Y., Lim, J., Yang, MH.: Online object tracking: a benchmark. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418 (2013). https://doi.org/10.1109/CVPR.2013.312

  45. Xing, J., Ai, H., Lao, S.: Multiple human tracking based on multi-view upper-body detection and discriminative learning. In: 2010 20th International Conference on Pattern Recognition, pp. 1698–1701. IEEE (2010)

  46. Xu, Y., Wang, Z., Li, Z., et al.: Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12549–12556 (2020)

  47. Yang, T., Chan, AB.: Learning dynamic memory networks for object tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 153–169 (2018)

  48. Yu, Y., Xiong, Y., Huang, W., et al.: Deformable siamese attention networks for visual object tracking. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6727–6736 (2020). https://doi.org/10.1109/CVPR42600.2020.00676

  49. Yuan, Y., Huang, L., Guo, J., et al.: Ocnet: Object context network for scene parsing (2018). arXiv:1809.00916

  50. Zhangm, G., Vela, PA.: Good features to track for visual slam. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1373–1382 (2015). https://doi.org/10.1109/CVPR.2015.7298743

  51. Zhang, L., Gonzalez-Garcia, A., Weijer, JVD., et al.: Learning the model update for siamese trackers. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4009–4018 (2019a). https://doi.org/10.1109/ICCV.2019.00411

  52. Zhang, S., He, X., Yan, S.: Latentgnn: Learning efficient non-local relations for visual recognition. In: International Conference on Machine Learning, pp. 7374–7383 (2019b)

  53. Zhang, Z., Peng, H., Fu, J., et al.: Ocean: Object-aware anchor-free tracking. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pp. 771–787. Springer (2020)

  54. Zhao, H., Jia, J., Koltun, V.: Exploring self-attention for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10,076–10,085 (2020)

  55. Zhu, X., Su, W., Lu, L., et al.: Deformable detr: deformable transformers for end-to-end object detection (2020). arXiv:2010.04159

Download references

Acknowledgements

The authors are very grateful to the anonymous reviewers for their valuable comments on improving this article. Additionally, this work was supported of the National Natural Science Foundation of China (Grant Nos. 62172350, 62172349) and Academic Degree and Postgraduate Teaching Reform research project in Hunan Province in 2021(2021JGYB085).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ziping Wu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shi, Y., Wu, Z., Chen, Y. et al. Siamese tracker with temporal information based on transformer-like feature fusion mechanism. Machine Vision and Applications 34, 59 (2023). https://doi.org/10.1007/s00138-023-01409-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-023-01409-y

Keywords

Navigation