Abstract
Global attention learning has been extensively applied in video-based person re-identification due to its superiority in capturing contextual correlations. However, existing global attention learning methods usually adopt the conventional neural network to model non-Euclidean contextual correlations, resulting in a limited representation ability. Inspired by the graph-structure property of the contextual correlations, we propose a spatial-temporal graph-guided global attention network (STG\(^3\)A) for video-based person re-identification. STG\(^3\)A comprises two graph-guided attention modules to capture the spatial contexts within a frame and temporal contexts across all frames in a sequence for global attention learning. Furthermore, the graphs from both modules are encoded as graph representations, which combine with weighted representations to grasp the spatial-temporal contextual information adequately for video feature learning. To reduce the effect of noisy graph nodes and learn robust graph representations, a graph node attention is developed to trade-off the importance of each graph node, leading to noise-tolerant graph models. Finally, we design a graph-guided fusion scheme to integrate the representations output by these two attentive modules for a more compact video feature. Extensive experiments on MARS and DukeMTMCVideoReID datasets demonstrate the superior performance of the STG\(^3\)A.
Similar content being viewed by others
References
Karanam, S., Gou, M., Wu, Z., Rates-Borras, A., Camps, O.I., Radke, R.J.: A systematic evaluation and benchmark for person re-identification: features, metrics, and datasets. IEEE Trans. Pattern Anal. Mach. Intell. 41(3), 523–536 (2019)
Ye, M., Shen, J., Lin, G., Xiang, T., Shao, L., Hoi, S.C.: Deep learning for person re-identification: a survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 2872–2893 (2021)
Sun, J., Li, Y., Chen, H., Peng, Y., Zhu, J.: Visible-infrared person re-identification model based on feature consistency and modal indistinguishability. Mach. Vis. Appl. 34(1), 14 (2023)
Perwaiz, N., Shahzad, M., Fraz, M.: Ubiquitous vision of transformers for person re-identification. Mach. Vis. Appl. 34(2), 27–40 (2023)
Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)
Dehghan, A., Modiri Assari, S., Shah, M.: Gmmcp tracker: Globally optimal generalized maximum multi clique problem for multiple object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE pp. 4091–4099 (2015)
Zhang, Z., Lan, C., Zeng, W., Jin, X., Chen, Z.: Relation-aware global attention for person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3186–3195 (2020). IEEE
Hou, R., Ma, B., Chang, H., Gu, X., Shan, S., Chen, X.: Iaunet: global context-aware feature learning for person reidentification. IEEE Trans. Neural Netw. Learn. Syst. 32(10), 4460–4474 (2020)
Wu, Y., Bourahla, O.E.F., Li, X., Wu, F., Tian, Q., Zhou, X.: Adaptive graph representation learning for video person re-identification. IEEE Trans. Image Process. 29, 8821–8830 (2020)
Hou, R., Chang, H., Ma, B., Shan, S., Chen, X.: Temporal complementary learning for video person re-identification. In: European Conference on Computer Vision, pp. 388–405 (2020). Springer
Zhang, Z., Lan, C., Zeng, W., Chen, Z.: Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10407–10416 (2020). IEEE
Wu, Y., Lin, Y., Dong, X., Yan, Y., Ouyang, W., Yang, Y.: Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5177–5186 (2018). IEEE
Li, S., Bak, S., Carr, P., Wang, X.: Diversity regularized spatiotemporal attention for video-based person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 369–378 (2018). IEEE
Zhang, R., Li, J., Sun, H., Ge, Y., Luo, P., Wang, X., Lin, L.: Scan: Self-and-collaborative attention network for video person re-identification. IEEE Trans. Image Process. 28(10), 4870–4882 (2019)
Chen, G., Lu, J., Yang, M., Zhou, J.: Spatial-temporal attention-aware learning for video-based person re-identification. IEEE Trans. Image Process. 28(9), 4192–4205 (2019)
Li, J., Zhang, S., Huang, T.: Multi-scale 3d convolution network for video based person re-identification. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8618–8625 (2019)
Chen, G., Lu, J., Yang, M., Zhou, J.: Learning recurrent 3d attention for video-based person re-identification. IEEE Trans. Image Process. 29, 6963–6976 (2020)
Yan, Y., Ni, B., Song, Z., Ma, C., Yan, Y., Yang, X.: Person re-identification via recurrent feature aggregation. In: European Conference on Computer Vision, pp. 701–716 (2016). Springer
Li, X., Zhou, W., Zhou, Y., Li, H.: Relation-guided spatial attention and temporal refinement for video-based person re-identification. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11434–11441 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016). IEEE
Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32(1), 4–24 (2020)
Nayak, R., Balabantaray, B.K., Patra, D.: A new single-image super-resolution using efficient feature fusion and patch similarity in non-Euclidean space. Arab. J. Sci. Eng. 45(12), 10261–10285 (2020)
Yu, J., Tan, M., Zhang, H., Rui, Y., Tao, D.: Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44(2), 563–578 (2019)
Gu, X., Chang, H., Ma, B., Zhang, H., Chen, X.: Appearance-preserving 3d convolution for video-based person re-identification. In: European Conference on Computer Vision, pp. 228–243 (2020). Springer
Chen, Y., Duffner, S., Stoian, A., Dufour, J.-Y., Baskurt, A.: List-wise learning-to-rank with convolutional neural networks for person re-identification. Mach. Vis. Appl. 32, 1–14 (2021). (Springer)
Ye, Z., Hong, C., Zeng, Z., Zhuang, W.: Self-supervised person re-identification with channel-wise transformer. In: 2022 IEEE International Conference on Big Data (Big Data), pp. 4210–4217 (2022). IEEE
Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015). IEEE
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271 (2017). IEEE
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229 (2020). Springer
Hong, C., Yu, J., Wan, J., Tao, D., Wang, M.: Multimodal deep autoencoder for human pose recovery. IEEE Trans. Image Process. 24(12), 5659–5670 (2015)
Hong, C., Yu, J., Zhang, J., Jin, X., Lee, K.-H.: Multimodal face-pose estimation with multitask manifold deep learning. IEEE Trans. Industr. Inf. 15(7), 3952–3961 (2018)
Gu, X., Chang, H., Ma, B., Zhang, H., Chen, X.: Appearance-preserving 3d convolution for video-based person re-identification. In: European Conference on Computer Vision, pp. 228–243 (2020). Springer
Gao, C., Chen, Y., Yu, J.-G., Sang, N.: Pose-guided spatiotemporal alignment for video-based person re-identification. Inf. Sci. 527, 176–190 (2020)
Zhao, H., Tian, M., Sun, S., Shao, J., Yan, J., Yi, S., Wang, X., Tang, X.: Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1077–1085 (2017). IEEE
Fang, P., Zhou, J., Roy, S.K., Petersson, L., Harandi, M.: Bilinear attention networks for person retrieval. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 8030–8039 (2019). IEEE
Chen, B., Deng, W., Hu, J.: Mixed high-order attention network for person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 371–381 (2019). IEEE
Chen, T., Ding, S., Xie, J., Yuan, Y., Chen, W., Yang, Y., Ren, Z., Wang, Z.: Abd-net: Attentive but diverse person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 8351–8361 (2019). IEEE
Dong, H., Yang, Y., Sun, X., Zhang, L., Fang, L.: Cascaded attention-guided multi-granularity feature learning for person re-identification. Mach. Vis. Appl. 34(1), 1–16 (2023). (Springer)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018). IEEE
Chen, J., Lei, B., Song, Q., Ying, H., Chen, D.Z., Wu, J.: A hierarchical graph network for 3d object detection on point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 392–401 (2020). IEEE
Lin, Z.-H., Huang, S.-Y., Wang, Y.-C.F.: Convolution in the cloud: Learning deformable kernels in 3d graph convolution networks for point cloud analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1800–1809 (2020). IEEE
Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.N.: Semantic graph convolutional networks for 3d human pose regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3425–3435 (2019). IEEE
Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T.-J., Yuan, J., Thalmann, N.M.: Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2272–2281 (2019). IEEE
Shen, Y., Li, H., Yi, S., Chen, D., Wang, X.: Person re-identification with deep similarity-guided graph neural network. In: Proceedings of the European Conference on Computer Vision, pp. 486–504 (2018). Springer
Yang, J., Zheng, W.-S., Yang, Q., Chen, Y.-C., Tian, Q.: Spatial-temporal graph convolutional network for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3289–3299 (2020). IEEE
Guo, S., Lin, Y., Feng, N., Song, C., Wan, H.: Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 922–929 (2019)
Guo, S., Lin, Y., Wan, H., Li, X., Cong, G.: Learning dynamics and heterogeneity of spatial-temporal graph data for traffic forecasting. IEEE Trans. Knowl. Data Eng. 34(11), 5415–5428 (2021). (IEEE)
Su, Y., Zhu, H., Tan, Y., An, S., Xing, M.: Prime: privacy-preserving video anomaly detection via motion exemplar guidance. Knowl.-Based Syst. 278, 110872 (2023)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)
Liu, J., Zha, Z.-J., Wu, W., Zheng, K., Sun, Q.: Spatial-temporal correlation and topology learning for person re-identification in videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4370–4379 (2021). IEEE
Wei, X., Yu, R., Sun, J.: View-gcn: View-based graph convolutional network for 3d shape analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1850–1859 (2020). IEEE
Yang, L., Zhan, X., Chen, D., Yan, J., Loy, C.C., Lin, D.: Learning to cluster faces on an affinity graph. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2298–2306 (2019). IEEE
Huang, H.-C., Chuang, Y.-Y., Chen, C.-S.: Affinity aggregation for spectral clustering. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 773–780 (2012). IEEE
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015). PMLR
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323 (2011). JMLR Workshop and Conference Proceedings
Li, G., Muller, M., Thabet, A., Ghanem, B.: Deepgcns: Can gcns go as deep as cnns? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9267–9276 (2019). IEEE
Zheng, L., Bie, Z., Sun, Y., Wang, J., Su, C., Wang, S., Tian, Q.: Mars: A video benchmark for large-scale person re-identification. In: European Conference on Computer Vision, pp. 868–884 (2016). Springer
Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: European Conference on Computer Vision Workshops, pp. 17–35 (2016). Springer
Zhong, Z., Zheng, L., Cao, D., Li, S.: Re-ranking person re-identification with k-reciprocal encoding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1318–1327 (2017). IEEE
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014). IEEE
Jiang, X., Qiao, Y., Yan, J., Li, Q., Zheng, W., Chen, D.: SSN3D: Self-separated network to align parts for 3D convolution in video person re-identification. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1691–1699 (2021)
Liu, Y., Yan, J., Ouyang, W.: Quality aware network for set to set recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5790–5799 (2017). IEEE
Chen, D., Li, H., Xiao, T., Yi, S., Wang, X.: Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1169–1178 (2018). IEEE
Fu, Y., Wang, X., Wei, Y., Huang, T.: Sta: Spatial-temporal attention for large-scale video-based person re-identification. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8287–8294 (2019)
Li, J., Wang, J., Tian, Q., Gao, W., Zhang, S.: Global-local temporal representations for video person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3958–3967 (2019). IEEE
Subramaniam, A., Nambiar, A., Mittal, A.: Co-segmentation inspired attention networks for video-based person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 562–572 (2019). IEEE
Hou, R., Ma, B., Chang, H., Gu, X., Shan, S., Chen, X.: Vrstc: Occlusion-free video person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7183–7192 (2019). IEEE
Hou, R., Chang, H., Ma, B., Huang, R., Shan, S.: Bicnet-tks: Learning efficient spatial-temporal representation for video person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2014–2023 (2021)
Chen, C., Ye, M., Qi, M., Wu, J., Liu, Y., Jiang, J.: Saliency and granularity: discovering temporal coherence for video-based person re-identification. IEEE Trans. Circuits Syst. Video Technol. 32(9), 6100–6112 (2022)
Pan, H., Chen, Y., He, Z.: Multi-granularity graph pooling for video-based person re-identification. Neural Netw. 160, 22–33 (2023)
Acknowledgements
This work is supported by the National Natural Science Foundation of China under Grant 62276019 and 62006017.
Author information
Authors and Affiliations
Contributions
The proposed method was designed by XL, QL and WW. Material preparation, data collection and experimentation were performed by XL, and the first draft of the manuscript was written by XL. The analysis of experimental results was performed by XL, QL, WW and JZ. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, X., Wang, W., Li, Q. et al. Spatial-temporal graph-guided global attention network for video-based person re-identification. Machine Vision and Applications 35, 8 (2024). https://doi.org/10.1007/s00138-023-01489-w
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138-023-01489-w