research-article

Skeleton Cluster Tracking for robust multi-view multi-person 3D human pose estimation

Authors:

Jinbao WangAuthors Info & Claims

Volume 246, Issue C

https://doi.org/10.1016/j.cviu.2024.104059

Published: 01 September 2024 Publication History

Abstract

The multi-view 3D human pose estimation task relies on 2D human pose estimation for each view; however, severe occlusion, truncation, and human interaction lead to incorrect 2D human pose estimation for some views. The traditional “Matching-Lifting-Tracking” paradigm amplifies the incorrect 2D human pose into an incorrect 3D human pose, which significantly challenges the robustness of multi-view 3D human pose estimation. In this paper, we propose a novel method that tackles the inherent difficulties of the traditional paradigm. This method is rooted in the newly devised “Skeleton Pooling-Clustering-Tracking (SPCT)” paradigm. It initiates a 2D human pose estimation for each perspective. Then a symmetrical dilated network is created for skeleton pool estimation. Upon clustering the skeleton pool, we introduce and implement an innovative tracking method that is explicitly designed for the SPCT paradigm. The tracking method refines and filters the skeleton clusters, thereby enhancing the robustness of the multi-person 3D human pose estimation results. By coupling the skeleton pool with the tracking refinement process, our method obtains high-quality multi-person 3D human pose estimation results despite severe occlusions that produce erroneous 2D and 3D estimates. By employing the proposed SPCT paradigm and a computationally efficient network architecture, our method outperformed existing approaches regarding robustness on the Shelf, 4D Association, and CMU Panoptic datasets, and could be applied in practical scenarios such as markerless motion capture and animation production.

Highlights

•

A novel “Skeleton Pooling-Clustering-Tracking (SPCT)” paradigm for 3D HPE.

•

Superior robustness against occlusions for challenging scenarios.

•

Real-time performance with low computational complexity for practical applications.

•

Proven effectiveness of the method on challenging datasets.

References

[1]

Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., Ilic, S., 2014. 3D pictorial structures for multiple human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1669–1676.

[2]

Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y., 2017. Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7291–7299.

[3]

Chen, L., Ai, H., Chen, R., Zhuang, Z., Liu, S., 2020. Cross-view tracking for multi-human 3D pose estimation at over 100 FPS. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3279–3288.

[4]

Choudhury R., Kitani K.M., Jeni L.A., TEMPO: Efficient multi-view pose estimation, tracking, and forecasting, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, IEEE, 2023, pp. 14704–14714.

[5]

Chu, H., Lee, J.-H., Lee, Y.-C., Hsu, C.-H., Li, J.-D., Chen, C.-S., 2021. Part-aware measurement for robust multi-view multi-human 3D pose estimation and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1472–1481.

[6]

Collins, R.T., 1996. A space-sweep approach to true multi-image matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]

Dong J., Fang Q., Jiang W., Yang Y., Huang Q., Bao H., Zhou X., Fast and robust multi-person 3D pose estimation and tracking from multiple views, IEEE Trans. Pattern Anal. Mach. Intell. 44 (10) (2022) 6981–6992.

[8]

Dong, J., Jiang, W., Huang, Q., Bao, H., Zhou, X., 2019. Fast and robust multi-person 3D pose estimation from multiple views. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7792–7801.

[9]

Dong, Z., Song, J., Chen, X., Guo, C., Hilliges, O., 2021. Shape-aware multi-person pose estimation from multi-view images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11158–11168.

[10]

Ghafoor M., Mahmood A., Quantification of occlusion handling capability of 3D human pose estimation framework, IEEE Trans. Multimed. 25 (2023) 3311–3318.

[11]

Girshick, R., 2015. Fast R-CNN. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1440–1448.

[12]

He, Y., Yan, R., Fragkiadaki, K., Yu, S.-I., 2020. Epipolar transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7779–7788.

[13]

He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 770–778.

[14]

Huang C., Jiang S., Li Y., Zhang Z., Traish J., Deng C., Ferguson S., Xu R.Y.D., End-to-end dynamic matching network for multi-view multi-person 3D pose estimation, in: European Conference on Computer Vision, Springer, 2020, pp. 477–493.

[15]

Iskakov, K., Burkov, E., Lempitsky, V., Malkov, Y., 2019. Learnable triangulation of human pose. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7717–7726.

[16]

Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara, S., Sheikh, Y., 2015. Panoptic studio: A massively multiview system for social motion capture. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3334–3342.

[17]

Kwon O.-H., Tanke J., Gall J., Recursive bayesian filtering for multiple human pose tracking from multiple cameras, Proceedings of the Asian Conference on Computer Vision, vol. 12623, 2020, pp. 438–453.

[18]

Li W., Liu H., Ding R., Liu M., Wang P., Yang W., Exploiting temporal contexts with strided transformer for 3D human pose estimation, IEEE Trans. Multimed. 25 (2023) 1282–1293.

[19]

Li, W., Liu, H., Tang, H., Wang, P., Van Gool, L., 2022. MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13147–13156.

[20]

Li M., Zhou Z., Liu X., Multi-person pose estimation using bounding box constraint and LSTM, IEEE Trans. Multimed. 21 (10) (2019) 2653–2663,.

Digital Library

[21]

Lin, J., Lee, G.H., 2021. Multi-view multi-person 3D pose estimation with plane sweep stereo. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11886–11895.

[22]

Lin T.-Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., Dollár P., Zitnick C.L., Microsoft coco: Common objects in context, European Conference on Computer Vision, vol. 8693, Springer, 2014, pp. 740–755.

[23]

Martinez, J., Hossain, R., Romero, J., Little, J.J., 2017. A simple yet effective baseline for 3D human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2640–2649.

[24]

Milan A., Leal-Taixé L., Reid I., Roth S., Schindler K., MOT16: A benchmark for multi-object tracking, 2016, arXiv preprint arXiv:1603.00831.

[25]

Ning G., Zhang Z., He Z., Knowledge-guided deep fractal neural networks for human pose estimation, IEEE Trans. Multimed. 20 (5) (2018) 1246–1259.

[26]

Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M., 2019. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7753–7762.

[27]

Qiu, H., Wang, C., Wang, J., Wang, N., Zeng, W., 2019. Cross view fusion for 3D human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4342–4351.

[28]

Reddy, N.D., Guigues, L., Pishchulin, L., Eledath, J., Narasimhan, S.G., 2021. Tessetrack: End-to-end learnable multi-person articulated 3D pose tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15190–15200.

[29]

Remelli, E., Han, S., Honari, S., Fua, P., Wang, R.Y., 2020. Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled Representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6039–6048.

[30]

Shan, W., Liu, Z., Zhang, X., Wang, Z., Han, K., Wang, S., Ma, S., Gao, W., 2023. Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14715–14725.

[31]

Straw A., PyMVG: A Python library for multi-view geometry, 2022, GitHub repository, GitHub, https://github.com/strawlab/pymvg.

[32]

Sun, Y., Bao, Q., Liu, W., Fu, Y., Michael J., B., Mei, T., 2021. Monocular, One-stage, Regression of Multiple 3D People. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11159–11168.

[33]

Sun, K., Xiao, B., Liu, D., Wang, J., 2019. Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5693–5703.

[34]

Tanke J., Gall J., Iterative greedy matching for 3D human pose tracking from multiple views, in: German Conference on Pattern Recognition, Springer, 2019, pp. 537–550.

[35]

Tu H., Wang C., Zeng W., Voxelpose: Towards multi-camera 3D human pose estimation in wild environment, in: European Conference on Computer Vision, Springer, 2020, pp. 197–212.

[36]

Vo M.P., Yumer E., Sunkavalli K., Hadap S., Sheikh Y.A., Narasimhan S.G., Self-supervised multi-view person association and its applications, IEEE Trans. Pattern Anal. Mach. Intell. 43 (8) (2021) 2794–2808.

[37]

Wang, W., Zhu, H., Dai, J., Pang, Y., Shen, J., Shao, L., 2020. Hierarchical human parsing with typed part-relation reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8929–8939.

[38]

Wu, S., Jin, S., Liu, W., Bai, L., Qian, C., Liu, D., Ouyang, W., 2021. Graph-based 3D multi-person pose estimation using multi-view images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11148–11157.

[39]

Xu Y., Wang W., Liu T., Liu X., Xie J., Zhu S.-C., Monocular 3D pose estimation via pose grammar and data augmentation, IEEE Trans. Pattern Anal. Mach. Intell. 44 (10) (2021) 6327–6344.

[40]

Yang, L., Li, L., Xin, X., Sun, Y., Song, Q., Wang, W., 2023. Large-scale person detection and localization using overhead fisheye cameras. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19961–19971.

[41]

Ye H., Zhu W., Wang C., Wu R., Wang Y., Faster VoxelPose: Real-time 3D human pose estimation by orthographic projection, European Conference on Computer Vision, vol. 13666, 2022, pp. 142–159.

[42]

Zhang, Y., An, L., Yu, T., Li, X., Li, K., Liu, Y., 2020. 4d association graph for realtime multi-person motion capture using multiple video cameras. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1324–1333.

[43]

Zhang J., Cai Y., Yan S., Feng J., et al., Direct multi-view multi-person 3D pose estimation, Adv. Neural Inf. Process. Syst. 34 (2021) 13153–13164.

[44]

Zhang Y., Wang C., Wang X., Liu W., Zeng W., Voxeltrack: Multi-person 3D human pose estimation and tracking in the wild, IEEE Trans. Pattern Anal. Mach. Intell. (2022).

[45]

Zhao X., Fu Y., Ning H., Liu Y., Huang T.S., Human pose regression through multiview visual fusion, IEEE Trans. Circuits Syst. Video Technol. 20 (7) (2010) 957–966.

[46]

Zhou, Z., Shuai, Q., Wang, Y., Fang, Q., Ji, X., Li, F., Bao, H., Zhou, X., 2022. QuickPose: Real-time Multi-view Multi-person Pose Estimation in Crowded Scenes. In: ACM SIGGRAPH 2022 Conference Proceedings. pp. 45:1–45:9.

[47]

Zhou T., Yang Y., Wang W., Differentiable multi-granularity human parsing, IEEE Trans. Pattern Anal. Mach. Intell. 45 (7) (2023) 8296–8310.

[48]

Zou S., Zuo X., Wang S., Qian Y., Guo C., Cheng L., Human pose and shape estimation from single polarization images, IEEE Trans. Multimed. 25 (2023) 3560–3572.

Index Terms

Skeleton Cluster Tracking for robust multi-view multi-person 3D human pose estimation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Computer graphics

Index terms have been assigned to the content through auto-classification.

Recommendations

Multi-person 3D pose estimation from a single image captured by a fisheye camera
Abstract
Multi-person 3D pose estimation with absolute depths for a fisheye camera is a challenging task but with valuable applications in daily life, especially for video surveillance. However, to the best of our knowledge, such problem has not been ...
Highlights
- We propose a novel method for multi-person 3D pose estimation from a fisheye image.
- A re-projection module is introduced to alleviate the negative impact of distortions.
- Absolute 3D poses are obtained by our method without using ...
Unsupervised universal hierarchical multi-person 3D pose estimation for natural scenes
Abstract
Multi-person 3D pose estimation using a monocular freely moving camera in real-world scenarios remains a challenge. There is a lack of data with 3D ground truth, and real-world scenes usually contain self-occlusions and inter-person occlusions. To ...
Monocular human pose estimation: A survey of deep learning-based methods
Abstract
Vision-based monocular human pose estimation, as one of the most fundamental and challenging problems in computer vision, aims to obtain posture of the human body from input images or video sequences. The recent developments of deep ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Computer Vision and Image Understanding

Computer Vision and Image Understanding Volume 246, Issue C

Sep 2024

137 pages

Issue’s Table of Contents

Elsevier Inc.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 September 2024

Author Tags

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents