SDFPoseGraphNet: Spatial Deep Feature Pose Graph Network for 2D Hand Pose Estimation
<p>Illustration of the SDFPoseGraphNet architectural design.</p> "> Figure 2
<p>Architecture of VGG-19 with SA module for enhanced 2D HPE.</p> "> Figure 3
<p>Comprehensive overview of First and Second Inference Modules (FIM and SIM).</p> "> Figure 4
<p>Illustrative representation of message passing within a hand tree structure.</p> "> Figure 5
<p>PCK evaluation for performance comparison: proposed model against existing models [<a href="#B12-sensors-23-09088" class="html-bibr">12</a>,<a href="#B16-sensors-23-09088" class="html-bibr">16</a>,<a href="#B33-sensors-23-09088" class="html-bibr">33</a>]. (<b>a</b>) Test; (<b>b</b>) Validation.</p> "> Figure 6
<p>Visualizing the performance of SDFPoseGraphNet: random image analysis.</p> "> Figure 7
<p>Illustrative comparison of 2D HPE: (<b>a</b>) Ground truth; (<b>b</b>) Ours; (<b>c</b>) CDGCN [<a href="#B16-sensors-23-09088" class="html-bibr">16</a>]; and (<b>d</b>) AGMN [<a href="#B33-sensors-23-09088" class="html-bibr">33</a>].</p> "> Figure 8
<p>(<b>a</b>) PCK comparison of the First Inference Module (FIM) with and without the integration of the SA module; (<b>b</b>) PCK comparison of FIM with preprocessed data (PD) and original data.</p> "> Figure 9
<p>(<b>a</b>) Original image (<b>b</b>) Preprocessed image.</p> ">
Abstract
:1. Introduction
- We introduce SDFPoseGraphNet, a novel framework that enhances the capabilities of VGG-19 by incorporating SA mechanisms; our model effectively captures spatial information from hand images, allowing VGG-19 to extract deep feature maps that capture intricate relationships among various hand joints.
- To address the challenge of accurate pose estimation, we incorporate a PGM into SDFPoseGraphNet. This model utilizes adaptively learned SIM parameters, derived from the deep feature maps extracted by VGG-19, to model the geometric constraints among hand joints. The adaptivity of the parameters enables personalized pose estimation, tailoring the model to the unique characteristics of each individual input image.
- The model combines FIM potentials and SIM parameters, which play a crucial role in the final pose estimation performed by the PGM. The inference process incorporates techniques like message passing, refining, and enhancing the accuracy of the joint predictions.
2. Related Works
3. SDFPoseGraphNet
3.1. Optimized VGG-19 Backbone with SA Module Integration for Improved 2D HPE Feature Extraction
3.2. Architectural and Operational Insights into the FIM
3.3. Architectural and Operational Insights into the SIM
3.4. Final Graph Inference Module
4. Experimental Setup
4.1. Dataset
4.2. Implementation Details
4.3. Loss Function
4.4. Model Optimization
4.5. Activation Functions
4.6. Evaluation Metric
5. Results and Analysis
5.1. Quantitative Results
5.2. Qualitative Results
5.3. Ablation Study
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
SDFPoseGraphNet | Spatial Deep Feature Pose Graph Network |
PGM | Pose Graph Model |
FIM | First Inference Module |
SIM | Second Inference Module |
VGG | Visual Geometry Group |
CPM | Convolutional Pose Machine |
CNN | Convolutional Neural Network |
DCNN | Deep Convolutional Neural Network |
HPE | Hand Pose Estimation |
SA | Spatial Attention |
PD | Preprocessed Data |
AGMN | Adaptive Graphical Model Network |
CDGCN | Cascaded Deep Graph Convolutional Neural Network |
CMU | Carnegie Mellon University |
References
- Chen, W.; Yu, C.; Tu, C.; Lyu, Z.; Tang, J.; Ou, S.; Fu, Y.; Xue, Z. A Survey on Hand Pose Estimation with Wearable Sensors and Computer-Vision-Based Methods. Sensors 2020, 20, 1074. [Google Scholar] [CrossRef] [PubMed]
- Santavas, N.; Kansizoglou, I.; Bampis, L.; Karakasis, E.; Gasteratos, A. Attention! A Lightweight 2D Hand Pose Estimation Approach. IEEE Sens. J. 2021, 21, 11488–11496. [Google Scholar] [CrossRef]
- Joo, H.; Simon, T.; Li, X.; Liu, H.; Tan, L.; Gui, L.; Banerjee, S.; Godisart, T.; Nabbe, B.; Matthews, I.; et al. Panoptic Studio: A Massively Multiview System for Social Interaction Capture. arXiv 2016, arXiv:cs.CV/1612.03153. [Google Scholar] [CrossRef] [PubMed]
- Simon, T.; Joo, H.; Matthews, I.; Sheikh, Y. Hand keypoint detection in single images using multiview bootstrapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1145–1153. [Google Scholar]
- Zhang, Z.; Xie, S.; Chen, M.; Zhu, H. HandAugment: A simple data augmentation method for depth-based 3D hand pose estimation. arXiv 2020, arXiv:2001.00702. [Google Scholar]
- Ge, L.; Cai, Y.; Weng, J.; Yuan, J. Hand pointnet: 3D hand pose estimation using point sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8417–8426. [Google Scholar]
- Yuan, S.; Garcia-Hernando, G.; Stenger, B.; Moon, G.; Chang, J.Y.; Lee, K.M.; Molchanov, P.; Kautz, J.; Honari, S.; Ge, L.; et al. Depth-based 3D hand pose estimation: From current achievements to future goals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2636–2645. [Google Scholar]
- Cai, Y.; Ge, L.; Cai, J.; Yuan, J. Weakly-supervised 3D hand pose estimation from monocular rgb images. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 14–18 September 2018; pp. 666–682. [Google Scholar]
- Panteleris, P.; Oikonomidis, I.; Argyros, A. Using a single rgb frame for real time 3D hand pose estimation in the wild. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 436–445. [Google Scholar]
- Boukhayma, A.; Bem, R.d.; Torr, P.H. 3D hand shape and pose from images in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10843–10852. [Google Scholar]
- Mueller, F.; Bernard, F.; Sotnychenko, O.; Mehta, D.; Sridhar, S.; Casas, D.; Theobalt, C. Ganerated hands for real-time 3D hand tracking from monocular rgb. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 49–59. [Google Scholar]
- Wei, S.E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4724–4732. [Google Scholar]
- Song, J.; Wang, L.; Van Gool, L.; Hilliges, O. Thin-slicing network: A deep structured model for pose estimation in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4220–4229. [Google Scholar]
- Tompson, J.J.; Jain, A.; LeCun, Y.; Bregler, C. Joint training of a convolutional network and a graphical model for human pose estimation. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar]
- Yang, W.; Ouyang, W.; Li, H.; Wang, X. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3073–3082. [Google Scholar]
- Salman, S.A.; Zakir, A.; Takahashi, H. Cascaded deep graphical convolutional neural network for 2D hand pose estimation. In Proceedings of the International Workshop on Advanced Imaging Technology (IWAIT) 2023, SPIE, Jeju, Republic of Korea, 9–11 January 2023; Volume 12592, pp. 227–232. [Google Scholar]
- Zhu, X.; Cheng, D.; Zhang, Z.; Lin, S.; Dai, J. An empirical study of spatial attention mechanisms in deep networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6688–6697. [Google Scholar]
- Sun, J.; Zhang, Z.; Yang, L.; Zheng, J. Multi-view hand gesture recognition via pareto optimal front. IET Image Process. 2020, 14, 3579–3587. [Google Scholar] [CrossRef]
- Liu, Y.; Jiang, J.; Sun, J.; Wang, X. InterNet+: A Light Network for Hand Pose Estimation. Sensors 2021, 21, 6747. [Google Scholar] [CrossRef] [PubMed]
- Sun, X.; Wang, B.; Huang, L.; Zhang, Q.; Zhu, S.; Ma, Y. CrossFuNet: RGB and Depth Cross-Fusion Network for Hand Pose Estimation. Sensors 2021, 21, 6095. [Google Scholar] [CrossRef] [PubMed]
- Chen, X.; Wang, G.; Guo, H.; Zhang, C. Pose guided structured region ensemble network for cascaded hand pose estimation. Neurocomputing 2020, 395, 138–149. [Google Scholar] [CrossRef]
- Ge, L.; Liang, H.; Yuan, J.; Thalmann, D. Robust 3D hand pose estimation from single depth images using multi-view CNNs. IEEE Trans. Image Process. 2018, 27, 4422–4436. [Google Scholar] [CrossRef] [PubMed]
- Ding, L.; Wang, Y.; Laganière, R.; Huang, D.; Fu, S. A CNN model for real time hand pose estimation. J. Vis. Commun. Image Represent. 2021, 79, 103200. [Google Scholar] [CrossRef]
- Wang, Y.; Peng, C.; Liu, Y. Mask-pose cascaded cnn for 2D hand pose estimation from single color image. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 3258–3268. [Google Scholar] [CrossRef]
- Kanis, J.; Gruber, I.; Krňoul, Z.; Boháček, M.; Straka, J.; Hrúz, M. MuTr: Multi-Stage Transformer for Hand Pose Estimation from Full-Scene Depth Image. Sensors 2023, 23, 5509. [Google Scholar] [CrossRef] [PubMed]
- Zimmermann, C.; Brox, T. Learning to estimate 3D hand pose from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4903–4911. [Google Scholar]
- Guan, X.; Shen, H.; Nyatega, C.O.; Li, Q. Repeated Cross-Scale Structure-Induced Feature Fusion Network for 2D Hand Pose Estimation. Entropy 2023, 25, 724. [Google Scholar] [CrossRef] [PubMed]
- Tekin, B.; Rozantsev, A.; Lepetit, V.; Fua, P. Direct prediction of 3D body poses from motion compensated sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 991–1000. [Google Scholar]
- Li, S.; Chan, A.B. 3D human pose estimation from monocular images with deep convolutional neural network. In Computer Vision–ACCV 2014, Proceedings of the 12th Asian Conference on Computer Vision, Singapore, 1–5 November 2014; Revised Selected Papers, Part II 12; Springer: Cham, Switzerland, 2015; pp. 332–347. [Google Scholar]
- Wan, C.; Probst, T.; Van Gool, L.; Yao, A. Dense 3D regression for hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5147–5156. [Google Scholar]
- Awan, M.J.; Masood, O.A.; Mohammed, M.A.; Yasin, A.; Zain, A.M.; Damaševičius, R.; Abdulkareem, K.H. Image-based malware classification using VGG19 network and spatial convolutional attention. Electronics 2021, 10, 2444. [Google Scholar] [CrossRef]
- Misra, D. Mish: A self regularized non-monotonic activation function. arXiv 2019, arXiv:1908.08681. [Google Scholar]
- Kong, D.; Chen, Y.; Ma, H.; Yan, X.; Xie, X. Adaptive graphical model network for 2D handpose estimation. arXiv 2019, arXiv:1909.08205. [Google Scholar]
- Algan, G.; Ulusoy, I. Image classification with deep learning in the presence of noisy labels: A survey. Knowl.-Based Syst. 2021, 215, 106771. [Google Scholar] [CrossRef]
Dataset | Training | Validation | Testing |
---|---|---|---|
CMU Panoptic | 11,853 | 1482 | 1482 |
Threshold | 0.01 | 0.02 | 0.03 | 0.04 | 0.05 | 0.06 | 0.07 | 0.08 | 0.09 | 0.10 | Average |
---|---|---|---|---|---|---|---|---|---|---|---|
CPM [12] | 22.88 | 58.10 | 73.48 | 80.45 | 84.27 | 86.88 | 88.91 | 90.42 | 91.61 | 92.61 | 76.96 |
AGMN [33] | 23.90 | 60.26 | 76.21 | 83.70 | 87.70 | 90.27 | 91.97 | 93.23 | 94.30 | 95.20 | 79.67 |
CDGCN [16] | 25.60 | 63.77 | 78.90 | 85.52 | 89.30 | 91.53 | 93.12 | 94.33 | 95.33 | 96.02 | 81.34 |
SDFPose GraphNet | 27.11 | 66.12 | 81.10 | 87.34 | 90.62 | 92.73 | 94.19 | 95.21 | 96.07 | 96.79 | 82.73 |
Threshold | 0.01 | 0.02 | 0.03 | 0.04 | 0.05 | 0.06 | 0.07 | 0.08 | 0.09 | 0.10 |
---|---|---|---|---|---|---|---|---|---|---|
FIM | 24.28 | 61.21 | 76.63 | 83.55 | 87.36 | 89.90 | 91.64 | 93.03 | 94.08 | 94.97 |
SIM | 24.85 | 62.25 | 78.18 | 85.47 | 89.33 | 91.60 | 93.17 | 94.41 | 95.50 | 96.26 |
SDFPose GraphNet | 27.11 | 66.12 | 81.10 | 87.34 | 90.62 | 92.73 | 94.19 | 95.21 | 96.07 | 96.79 |
Threshold | 0.01 | 0.02 | 0.03 | 0.04 | 0.05 | 0.06 | 0.07 | 0.08 | 0.09 | 0.10 |
---|---|---|---|---|---|---|---|---|---|---|
FIM | 22.88 | 58.10 | 73.48 | 80.45 | 84.27 | 86.88 | 88.91 | 90.42 | 91.61 | 92.61 |
FIM with SA | 24.28 | 61.21 | 76.63 | 83.55 | 87.36 | 89.90 | 91.64 | 93.03 | 94.08 | 94.97 |
Threshold | 0.01 | 0.02 | 0.03 | 0.04 | 0.05 | 0.06 | 0.07 | 0.08 | 0.09 | 0.10 |
---|---|---|---|---|---|---|---|---|---|---|
FIM | 24.53 | 60.82 | 75.84 | 82.72 | 86.47 | 89.07 | 90.98 | 92.42 | 93.52 | 94.44 |
SIM | 23.85 | 60.11 | 76.21 | 83.68 | 87.87 | 90.52 | 92.44 | 93.84 | 94.85 | 95.63 |
SDFPose GraphNet | 26.25 | 64.22 | 79.44 | 85.93 | 89.41 | 91.74 | 93.30 | 94.51 | 95.42 | 96.22 |
Threshold | 0.01 | 0.02 | 0.03 | 0.04 | 0.05 | 0.06 | 0.07 | 0.08 | 0.09 | 0.10 |
---|---|---|---|---|---|---|---|---|---|---|
FIM | 25.43 | 61.30 | 75.30 | 81.48 | 85.02 | 87.56 | 89.45 | 90.80 | 91.96 | 92.97 |
SIM | 23.97 | 60.26 | 76.03 | 83.29 | 87.21 | 89.86 | 91.84 | 93.14 | 94.21 | 94.99 |
SDFPose GraphNet | 26.75 | 63.57 | 77.01 | 83.13 | 86.91 | 89.52 | 91.28 | 92.72 | 93.96 | 94.79 |
Threshold | 0.01 | 0.02 | 0.03 | 0.04 | 0.05 | 0.06 | 0.07 | 0.08 | 0.09 | 0.10 |
---|---|---|---|---|---|---|---|---|---|---|
FIM | 25.90 | 62.87 | 76.64 | 82.77 | 86.33 | 88.54 | 90.37 | 91.63 | 92.79 | 93.74 |
SIM | 24.38 | 61.67 | 77.71 | 84.69 | 88.59 | 90.98 | 92.69 | 93.92 | 94.87 | 95.71 |
SDFPose GraphNet | 26.25 | 64.12 | 79.01 | 85.89 | 89.89 | 91.88 | 93.14 | 94.67 | 95.44 | 96.33 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Salman, S.A.; Zakir, A.; Takahashi, H. SDFPoseGraphNet: Spatial Deep Feature Pose Graph Network for 2D Hand Pose Estimation. Sensors 2023, 23, 9088. https://doi.org/10.3390/s23229088
Salman SA, Zakir A, Takahashi H. SDFPoseGraphNet: Spatial Deep Feature Pose Graph Network for 2D Hand Pose Estimation. Sensors. 2023; 23(22):9088. https://doi.org/10.3390/s23229088
Chicago/Turabian StyleSalman, Sartaj Ahmed, Ali Zakir, and Hiroki Takahashi. 2023. "SDFPoseGraphNet: Spatial Deep Feature Pose Graph Network for 2D Hand Pose Estimation" Sensors 23, no. 22: 9088. https://doi.org/10.3390/s23229088
APA StyleSalman, S. A., Zakir, A., & Takahashi, H. (2023). SDFPoseGraphNet: Spatial Deep Feature Pose Graph Network for 2D Hand Pose Estimation. Sensors, 23(22), 9088. https://doi.org/10.3390/s23229088