Abstract
The position of the target in video trackers based on the Siamese network is usually obtained by calculating the similarity score of features between the target template and the predicted region. This method performs poorly in complex scenarios due to such problems as the deformation of the target and motion blur. To solve these problems, this paper proposes a transformer-like feature fusion mechanism to fuse the temporal information of consecutive frames of the video. We separate the encoder and decoder into two parallel branches to accommodate the characteristics of Siamese networks. The features of the target template are enhanced by a transformer-like encoder while temporal feature-related information is fused by using a transformer-like decoder. The results of experiments on the standard OTB100, VOT2018, UAV123, and NFS datasets showed that the proposed network outperforms most mainstream algorithms in the area.
Similar content being viewed by others
References
Bao, J., Wang, H., Lv, C., et al.: Iou-guided siamese tracking. Math. Probl. Eng. 2021, 1–10 (2021). https://doi.org/10.1155/2021/9127092
Bertinetto, L., Valmadre, J., Henriques, JF., et al.: Fully-convolutional siamese networks for object tracking. In: European Conference on Computer Vision. Springer, pp 850–865 (2016)
Bhat, G., Danelljan, M., Gool, LV., et al.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191 (2019a)
Bhat, G., Danelljan, M., Gool, LV., et al.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191 (2019b)
Carion, N., Massa, F., Synnaeve, G., et al.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229 (2020)
Chen, M., Radford, A., Child, R., et al.: Generative pretraining from pixels. In: International Conference on Machine Learning, pp. 1691–1703 (2020a)
Chen, T., Kornblith, S., Norouzi, M., et al.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020b)
Danelljan, M., Robinson, A., Khan, FS., et al.: Beyond correlation filters: learning continuous convolution operators for visual tracking. In: European Conference on Computer Vision, pp. 472–488 (2016)
Danelljan, M., Bhat, G., Khan, FS., et al.: Eco: Efficient convolution operators for tracking. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6931–6939 (2017) https://doi.org/10.1109/CVPR.2017.733
Danelljan, M., Bhat, G., Khan, FS., et al.: Atom: Accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669 (2019)
Devlin, J., Chang, MW., Lee, K., et al.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805
Dong, C., Loy, C.C., He, K., et al.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2015)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2020). arXiv:2010.11929
Fan, H., Lin, L., Yang, F., et al.: Lasot: a high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5374–5383 (2018)
Fu, J., Liu, J., Tian, H., et al.: Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3146–3154 (2019)
Galoogahi, HK., Fagg, A., Huang, C., et al.: Need for speed: A benchmark for higher frame rate object tracking. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1134–1143 (2017). https://doi.org/10.1109/ICCV.2017.128
Gao, P., Yuan, R., Wang, F., et al.: Siamese attentional keypoint network for high performance visual tracking. Knowl. Based Syst. 193(105), 448 (2020)
Han, Z., Jian, M., Wang, GG.: Convunext: an efficient convolution neural network for medical image segmentation, pp. 114219 (2021). https://doi.org/10.1016/j.knosys.2022.109512
He, A., Luo, C., Tian, X., et al.: A twofold siamese network for real-time object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4834–4843 (2018)
Held, D., Thrun, S., Savarese, S.: Learning to track at 100 fps with deep regression networks. In: European Conference on Computer Vision, pp. 749–765. Springer (2016)
Henriques, J.F., Caseiro, R., Martins, P., et al.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2014)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015). arXiv:1503.02531
Huang, K., Qin, P., Tu, X., et al.: Siamcam: a real-time siamese network for object tracking with compensating attention mechanism (2022). https://doi.org/10.3390/app12083931
Jian, M., Wang, J., Yu, H., et al.: Visual saliency detection by integrating spatial position prior of object with background cues, pp. 114219 (2021a). https://doi.org/10.1016/j.eswa.2020.114219
Jian, M., Wang, J., Yu, H., et al.: Integrating object proposal with attention networks for video saliency detection, pp 819–830 (2021b). https://doi.org/10.1016/j.ins.2021.08.069D
Jiang, P.T., Hou, Q., Cao, Y., et al.: Integral object mining via online attention accumulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2070–2079 (2019)
Kolesnikov, A., Beyer, L., Zhai, X., et al.: Big transfer (bit): general visual representation learning. In: Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp. 491–507 (2020)
Kristan, M., Leonardis, A., Matas, J., et al.: The sixth visual object tracking vot2018 challenge results. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 3–53 (2019)
Li, B., Wu, W., Wang, Q., et al.: Siamrpn++: evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4282–4291 (2018a)
Li, B., Yan, J., Wu, W., et al.: High performance visual tracking with siamese region proposal network. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8971–8980 (2018b). https://doi.org/10.1109/CVPR.2018.00935
Liu, L., Xing, J., Ai, H., et al.: Hand posture recognition using finger geometric feature. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pp. 565–568. IEEE (2013)
Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for uav tracking. In: European Conference on Computer Vision, pp. 445–461. Springer (2016)
Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4293–4302 (2016). https://doi.org/10.1109/CVPR.2016.465
Pang, H., Xie, M., Liu, C., et al.: Siamese tracking combing frequency channel attention with adaptive template, pp. 2493–2502 (2021)
Rahman, M.M., Ahmed, M.R., Laishram, L., et al.: Siamese high-level feature refine network for visual object tracking. Electronics (2020). https://doi.org/10.3390/electronics9111918
Ren, S., He, K., Girshick, R., et al.: Faster r-cnn: towards real-time object detection with region proposal networks. Adv. Neural. Inf. Process. Syst. 28, 91–99 (2015)
Tao, R., Gavves, E., Smeulders, AWM.: Siamese instance search for tracking. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1420–1429 (2016). https://doi.org/10.1109/CVPR.2016.158
Valmadre, J., Bertinetto, L., Henriques, J., et al.: End-to-end representation learning for correlation filter based tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2805–2813 (2017)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Wang, G., Luo, C., Sun, X., et al.: Tracking by instance detection: a meta-learning approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6288–6297 (2020)
Wang, Q., Zhang, L., Bertinetto, L., et al.: Fast online object tracking and segmentation: a unifying approach. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1328–1338 (2019). https://doi.org/10.1109/CVPR.2019.00142
Wang, X., Girshick, R., Gupta, A., et al.: Non-local neural networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018). https://doi.org/10.1109/CVPR.2018.00813
Wu, B., Xu, C., Dai, X., et al.: Visual transformers: Token-based image representation and processing for computer vision (2020). arXiv:2006.03677
Wu, Y., Lim, J., Yang, MH.: Online object tracking: a benchmark. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418 (2013). https://doi.org/10.1109/CVPR.2013.312
Xing, J., Ai, H., Lao, S.: Multiple human tracking based on multi-view upper-body detection and discriminative learning. In: 2010 20th International Conference on Pattern Recognition, pp. 1698–1701. IEEE (2010)
Xu, Y., Wang, Z., Li, Z., et al.: Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12549–12556 (2020)
Yang, T., Chan, AB.: Learning dynamic memory networks for object tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 153–169 (2018)
Yu, Y., Xiong, Y., Huang, W., et al.: Deformable siamese attention networks for visual object tracking. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6727–6736 (2020). https://doi.org/10.1109/CVPR42600.2020.00676
Yuan, Y., Huang, L., Guo, J., et al.: Ocnet: Object context network for scene parsing (2018). arXiv:1809.00916
Zhangm, G., Vela, PA.: Good features to track for visual slam. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1373–1382 (2015). https://doi.org/10.1109/CVPR.2015.7298743
Zhang, L., Gonzalez-Garcia, A., Weijer, JVD., et al.: Learning the model update for siamese trackers. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4009–4018 (2019a). https://doi.org/10.1109/ICCV.2019.00411
Zhang, S., He, X., Yan, S.: Latentgnn: Learning efficient non-local relations for visual recognition. In: International Conference on Machine Learning, pp. 7374–7383 (2019b)
Zhang, Z., Peng, H., Fu, J., et al.: Ocean: Object-aware anchor-free tracking. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pp. 771–787. Springer (2020)
Zhao, H., Jia, J., Koltun, V.: Exploring self-attention for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10,076–10,085 (2020)
Zhu, X., Su, W., Lu, L., et al.: Deformable detr: deformable transformers for end-to-end object detection (2020). arXiv:2010.04159
Acknowledgements
The authors are very grateful to the anonymous reviewers for their valuable comments on improving this article. Additionally, this work was supported of the National Natural Science Foundation of China (Grant Nos. 62172350, 62172349) and Academic Degree and Postgraduate Teaching Reform research project in Hunan Province in 2021(2021JGYB085).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shi, Y., Wu, Z., Chen, Y. et al. Siamese tracker with temporal information based on transformer-like feature fusion mechanism. Machine Vision and Applications 34, 59 (2023). https://doi.org/10.1007/s00138-023-01409-y
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138-023-01409-y