Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Skeleton Cluster Tracking for robust multi-view multi-person 3D human pose estimation

Published: 25 September 2024 Publication History

Abstract

The multi-view 3D human pose estimation task relies on 2D human pose estimation for each view; however, severe occlusion, truncation, and human interaction lead to incorrect 2D human pose estimation for some views. The traditional “Matching-Lifting-Tracking” paradigm amplifies the incorrect 2D human pose into an incorrect 3D human pose, which significantly challenges the robustness of multi-view 3D human pose estimation. In this paper, we propose a novel method that tackles the inherent difficulties of the traditional paradigm. This method is rooted in the newly devised “Skeleton Pooling-Clustering-Tracking (SPCT)” paradigm. It initiates a 2D human pose estimation for each perspective. Then a symmetrical dilated network is created for skeleton pool estimation. Upon clustering the skeleton pool, we introduce and implement an innovative tracking method that is explicitly designed for the SPCT paradigm. The tracking method refines and filters the skeleton clusters, thereby enhancing the robustness of the multi-person 3D human pose estimation results. By coupling the skeleton pool with the tracking refinement process, our method obtains high-quality multi-person 3D human pose estimation results despite severe occlusions that produce erroneous 2D and 3D estimates. By employing the proposed SPCT paradigm and a computationally efficient network architecture, our method outperformed existing approaches regarding robustness on the Shelf, 4D Association, and CMU Panoptic datasets, and could be applied in practical scenarios such as markerless motion capture and animation production.

Highlights

A novel “Skeleton Pooling-Clustering-Tracking (SPCT)” paradigm for 3D HPE.
Superior robustness against occlusions for challenging scenarios.
Real-time performance with low computational complexity for practical applications.
Proven effectiveness of the method on challenging datasets.

References

[1]
Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., Ilic, S., 2014. 3D pictorial structures for multiple human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1669–1676.
[2]
Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y., 2017. Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7291–7299.
[3]
Chen, L., Ai, H., Chen, R., Zhuang, Z., Liu, S., 2020. Cross-view tracking for multi-human 3D pose estimation at over 100 FPS. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3279–3288.
[4]
Choudhury R., Kitani K.M., Jeni L.A., TEMPO: Efficient multi-view pose estimation, tracking, and forecasting, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, IEEE, 2023, pp. 14704–14714.
[5]
Chu, H., Lee, J.-H., Lee, Y.-C., Hsu, C.-H., Li, J.-D., Chen, C.-S., 2021. Part-aware measurement for robust multi-view multi-human 3D pose estimation and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1472–1481.
[6]
Collins, R.T., 1996. A space-sweep approach to true multi-image matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[7]
Dong J., Fang Q., Jiang W., Yang Y., Huang Q., Bao H., Zhou X., Fast and robust multi-person 3D pose estimation and tracking from multiple views, IEEE Trans. Pattern Anal. Mach. Intell. 44 (10) (2022) 6981–6992.
[8]
Dong, J., Jiang, W., Huang, Q., Bao, H., Zhou, X., 2019. Fast and robust multi-person 3D pose estimation from multiple views. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7792–7801.
[9]
Dong, Z., Song, J., Chen, X., Guo, C., Hilliges, O., 2021. Shape-aware multi-person pose estimation from multi-view images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11158–11168.
[10]
Ghafoor M., Mahmood A., Quantification of occlusion handling capability of 3D human pose estimation framework, IEEE Trans. Multimed. 25 (2023) 3311–3318.
[11]
Girshick, R., 2015. Fast R-CNN. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1440–1448.
[12]
He, Y., Yan, R., Fragkiadaki, K., Yu, S.-I., 2020. Epipolar transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7779–7788.
[13]
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 770–778.
[14]
Huang C., Jiang S., Li Y., Zhang Z., Traish J., Deng C., Ferguson S., Xu R.Y.D., End-to-end dynamic matching network for multi-view multi-person 3D pose estimation, in: European Conference on Computer Vision, Springer, 2020, pp. 477–493.
[15]
Iskakov, K., Burkov, E., Lempitsky, V., Malkov, Y., 2019. Learnable triangulation of human pose. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7717–7726.
[16]
Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara, S., Sheikh, Y., 2015. Panoptic studio: A massively multiview system for social motion capture. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3334–3342.
[17]
Kwon O.-H., Tanke J., Gall J., Recursive bayesian filtering for multiple human pose tracking from multiple cameras, Proceedings of the Asian Conference on Computer Vision, vol. 12623, 2020, pp. 438–453.
[18]
Li W., Liu H., Ding R., Liu M., Wang P., Yang W., Exploiting temporal contexts with strided transformer for 3D human pose estimation, IEEE Trans. Multimed. 25 (2023) 1282–1293.
[19]
Li, W., Liu, H., Tang, H., Wang, P., Van Gool, L., 2022. MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13147–13156.
[20]
Li M., Zhou Z., Liu X., Multi-person pose estimation using bounding box constraint and LSTM, IEEE Trans. Multimed. 21 (10) (2019) 2653–2663,.
[21]
Lin, J., Lee, G.H., 2021. Multi-view multi-person 3D pose estimation with plane sweep stereo. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11886–11895.
[22]
Lin T.-Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., Dollár P., Zitnick C.L., Microsoft coco: Common objects in context, European Conference on Computer Vision, vol. 8693, Springer, 2014, pp. 740–755.
[23]
Martinez, J., Hossain, R., Romero, J., Little, J.J., 2017. A simple yet effective baseline for 3D human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2640–2649.
[24]
Milan A., Leal-Taixé L., Reid I., Roth S., Schindler K., MOT16: A benchmark for multi-object tracking, 2016, arXiv preprint arXiv:1603.00831.
[25]
Ning G., Zhang Z., He Z., Knowledge-guided deep fractal neural networks for human pose estimation, IEEE Trans. Multimed. 20 (5) (2018) 1246–1259.
[26]
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M., 2019. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7753–7762.
[27]
Qiu, H., Wang, C., Wang, J., Wang, N., Zeng, W., 2019. Cross view fusion for 3D human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4342–4351.
[28]
Reddy, N.D., Guigues, L., Pishchulin, L., Eledath, J., Narasimhan, S.G., 2021. Tessetrack: End-to-end learnable multi-person articulated 3D pose tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15190–15200.
[29]
Remelli, E., Han, S., Honari, S., Fua, P., Wang, R.Y., 2020. Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled Representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6039–6048.
[30]
Shan, W., Liu, Z., Zhang, X., Wang, Z., Han, K., Wang, S., Ma, S., Gao, W., 2023. Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14715–14725.
[31]
Straw A., PyMVG: A Python library for multi-view geometry, 2022, GitHub repository, GitHub, https://github.com/strawlab/pymvg.
[32]
Sun, Y., Bao, Q., Liu, W., Fu, Y., Michael J., B., Mei, T., 2021. Monocular, One-stage, Regression of Multiple 3D People. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11159–11168.
[33]
Sun, K., Xiao, B., Liu, D., Wang, J., 2019. Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5693–5703.
[34]
Tanke J., Gall J., Iterative greedy matching for 3D human pose tracking from multiple views, in: German Conference on Pattern Recognition, Springer, 2019, pp. 537–550.
[35]
Tu H., Wang C., Zeng W., Voxelpose: Towards multi-camera 3D human pose estimation in wild environment, in: European Conference on Computer Vision, Springer, 2020, pp. 197–212.
[36]
Vo M.P., Yumer E., Sunkavalli K., Hadap S., Sheikh Y.A., Narasimhan S.G., Self-supervised multi-view person association and its applications, IEEE Trans. Pattern Anal. Mach. Intell. 43 (8) (2021) 2794–2808.
[37]
Wang, W., Zhu, H., Dai, J., Pang, Y., Shen, J., Shao, L., 2020. Hierarchical human parsing with typed part-relation reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8929–8939.
[38]
Wu, S., Jin, S., Liu, W., Bai, L., Qian, C., Liu, D., Ouyang, W., 2021. Graph-based 3D multi-person pose estimation using multi-view images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11148–11157.
[39]
Xu Y., Wang W., Liu T., Liu X., Xie J., Zhu S.-C., Monocular 3D pose estimation via pose grammar and data augmentation, IEEE Trans. Pattern Anal. Mach. Intell. 44 (10) (2021) 6327–6344.
[40]
Yang, L., Li, L., Xin, X., Sun, Y., Song, Q., Wang, W., 2023. Large-scale person detection and localization using overhead fisheye cameras. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19961–19971.
[41]
Ye H., Zhu W., Wang C., Wu R., Wang Y., Faster VoxelPose: Real-time 3D human pose estimation by orthographic projection, European Conference on Computer Vision, vol. 13666, 2022, pp. 142–159.
[42]
Zhang, Y., An, L., Yu, T., Li, X., Li, K., Liu, Y., 2020. 4d association graph for realtime multi-person motion capture using multiple video cameras. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1324–1333.
[43]
Zhang J., Cai Y., Yan S., Feng J., et al., Direct multi-view multi-person 3D pose estimation, Adv. Neural Inf. Process. Syst. 34 (2021) 13153–13164.
[44]
Zhang Y., Wang C., Wang X., Liu W., Zeng W., Voxeltrack: Multi-person 3D human pose estimation and tracking in the wild, IEEE Trans. Pattern Anal. Mach. Intell. (2022).
[45]
Zhao X., Fu Y., Ning H., Liu Y., Huang T.S., Human pose regression through multiview visual fusion, IEEE Trans. Circuits Syst. Video Technol. 20 (7) (2010) 957–966.
[46]
Zhou, Z., Shuai, Q., Wang, Y., Fang, Q., Ji, X., Li, F., Bao, H., Zhou, X., 2022. QuickPose: Real-time Multi-view Multi-person Pose Estimation in Crowded Scenes. In: ACM SIGGRAPH 2022 Conference Proceedings. pp. 45:1–45:9.
[47]
Zhou T., Yang Y., Wang W., Differentiable multi-granularity human parsing, IEEE Trans. Pattern Anal. Mach. Intell. 45 (7) (2023) 8296–8310.
[48]
Zou S., Zuo X., Wang S., Qian Y., Guo C., Cheng L., Human pose and shape estimation from single polarization images, IEEE Trans. Multimed. 25 (2023) 3560–3572.

Index Terms

  1. Skeleton Cluster Tracking for robust multi-view multi-person 3D human pose estimation
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Please enable JavaScript to view thecomments powered by Disqus.

            Information & Contributors

            Information

            Published In

            cover image Computer Vision and Image Understanding
            Computer Vision and Image Understanding  Volume 246, Issue C
            Sep 2024
            137 pages

            Publisher

            Elsevier Science Inc.

            United States

            Publication History

            Published: 25 September 2024

            Author Tags

            1. 41A05
            2. 41A10
            3. 65D05
            4. 65D17

            Author Tags

            1. 3D human pose estimation
            2. Motion capture
            3. Deep learning

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • 0
              Total Citations
            • 0
              Total Downloads
            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 25 Nov 2024

            Other Metrics

            Citations

            View Options

            View options

            Login options

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media