Abstract
The estimation of 3D interacting hand pose from a single RGB image is a challenging problem. The hands tend to occlude each other and are self-similar in two-handed interactions. In this study, a simple, accurate end-to-end framework called HybridPoseNet is proposed for estimating 3D interactive hand pose. The hybrid network employs an encoder-decoder architecture. More specifically, the feature encoder is a hybrid structure that combines a convolutional neural network (CNN) with a transformer to accomplish the feature encoding of hand information. An ordinary CNN is employed to extract the local detailed features of a given image, and a vision transformer is used to capture the long-distance spatial interactions between the cross-positional feature vectors. Moreover, the 3D pose decoder is based on left and right network branches, which are fused via a feature enhancement module (FEM). The FEM helps reduce the ambiguity in appearance caused by the self-similarity of the hands. The decoder elevates the 2D pose to the 3D pose by estimating two depth components. The ablation experiments demonstrate the effectiveness of each module in the network. In addition, comprehensive experiments on the InterHand2.6M dataset show that the proposed method outperforms previous state-of-the-art methods for estimating interactive hand pose.
Similar content being viewed by others
Data availability
The datasets used in this study were obtained from the publicly accessible websites(https://mks0601.github.io/InterHand2.6M/)
References
Zhang, X., Li, Q., Mo, H., Zhang, W., Zheng, W.: End-to-end hand mesh recovery from a monocular rgb image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2354–2364 (2019)
Ge, L., Ren, Z., Li, Y., Xue, Z., Wang, Y., Cai, J., Yuan, J.: 3d hand shape and pose estimation from a single rgb image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10833–10842 (2019)
Cai, Y., Ge, L., Cai, J., Yuan, J.: Weakly-supervised 3d hand pose estimation from monocular rgb images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 666–682 (2018)
Zimmermann, C., Brox, T.: Learning to estimate 3d hand pose from single RGB images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4903–4911 (2017)
Malik, J., Abdelaziz, I., Elhayek, A., Shimada, S., Ali, S.A., Golyanik, V., Theobalt, C., Stricker, D.: Handvoxnet: deep voxel-based network for 3d hand shape and pose estimation from a single depth map. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7113–7122 (2020)
Xiong, F., Zhang, B., Xiao, Y., Cao, Z., Yu, T., Zhou, J.T., Yuan, J.: A2j: Anchor-to-joint regression network for 3d articulated pose estimation from a single depth image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 793–802 (2019)
Boukhayma, A., Bem, R.d., Torr, P.H.: 3d hand shape and pose from images in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10843–10852 (2019)
Chen, Y., Tu, Z., Kang, D., Chen, R., Bao, L., Zhang, Z., Yuan, J.: Joint hand-object 3d reconstruction from a single image with cross-branch feature fusion. IEEE Trans. Image Process. 30, 4008–4021 (2021)
Choi, C., Ho Yoon, S., Chen, C.-N., Ramani, K.: Robust hand pose estimation during the interaction with an unknown object. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3123–3132 (2017)
Doosti, B., Naha, S., Mirbagheri, M., Crandall, D.J.: Hope-net: a graph-based model for hand-object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6608–6617 (2020)
Oberweger, M., Wohlhart, P., Lepetit, V.: Generalized feedback loop for joint hand-object pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 1898–1912 (2019)
Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Full dof tracking of a hand interacting with an object by modeling occlusions and physical constraints. In: 2011 International Conference on Computer Vision, pp. 2088–2095 (2011). IEEE
Tekin, B., Bogo, F., Pollefeys, M.: H+ o: Unified egocentric recognition of 3d hand-object poses and interactions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4511–4520 (2019)
Taylor, J., Tankovich, V., Tang, D., Keskin, C., Kim, D., Davidson, P., Kowdle, A., Izadi, S.: Articulated distance fields for ultra-fast tracking of hands interacting. ACM Trans. Graph. (TOG) 36(6), 1–12 (2017)
Wang, J., Mueller, F., Bernard, F., Sorli, S., Sotnychenko, O., Qian, N., Otaduy, M.A., Casas, D., Theobalt, C.: Rgb2hands: real-time tracking of 3d hand interactions from monocular RGB video. ACM Trans. Graph. (ToG) 39(6), 1–16 (2020)
Ballan, L., Taneja, A., Gall, J., Van Gool, L., Pollefeys, M.: Motion capture of hands in action using discriminative salient points. In: Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, Proceedings, Part VI 12, pp. 640–653 (2012). Springer
Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Tracking the articulated motion of two strongly interacting hands. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1862–1869 (2012). IEEE
Mueller, F., Davis, M., Bernard, F., Sotnychenko, O., Verschoor, M., Otaduy, M.A., Casas, D., Theobalt, C.: Real-time pose and shape reconstruction of two interacting hands with a single depth camera. ACM Trans. Graph. (ToG) 38(4), 1–13 (2019)
Tzionas, D., Ballan, L., Srikantha, A., Aponte, P., Pollefeys, M., Gall, J.: Capturing hands in action using discriminative salient points and physics simulation. Int. J. Comput. Vis. 118, 172–193 (2016)
Romero, J., Tzionas, D., Black, M.J.: Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610 (2022)
Moon, G., Yu, S.-I., Wen, H., Shiratori, T., Lee, K.M.: Interhand2. 6m: a dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, Proceedings, Part XX 16, pp. 548–564 (2020). Springer
Fan, Z., Spurr, A., Kocabas, M., Tang, S., Black, M.J., Hilliges, O.: Learning to disambiguate strongly interacting hands via probabilistic per-pixel part segmentation. In: 2021 International Conference on 3D Vision (3DV), pp. 1–10 (2021). IEEE
Rong, Y., Wang, J., Liu, Z., Loy, C.C.: Monocular 3d reconstruction of interacting hands via collision-aware factorized refinements. In: 2021 International Conference on 3D Vision (3DV), pp. 432–441 (2021). IEEE
Meng, H., Jin, S., Liu, W., Qian, C., Lin, M., Ouyang, W., Luo, P.: 3d interacting hand pose estimation by hand de-occlusion and removal. In: European Conference on Computer Vision, pp. 380–397 (2022). Springer
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229 (2020). Springer
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Wang, H., Zhu, Y., Adam, H., Yuille, A., Chen, L.-C.: Max-deeplab: End-to-end panoptic segmentation with mask transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5463–5474 (2021)
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1954–1963 (2021)
Huang, L., Tan, J., Liu, J., Yuan, J.: Hand-transformer: non-autoregressive structured modeling for 3d hand pose estimation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, Proceedings, Part XXV 16, pp. 17–33 (2020). Springer
Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., Wang, J.: Hrformer: high-resolution transformer for dense prediction. arXiv preprint arXiv:2110.09408 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Yang, S., Quan, Z., Nie, M., Yang, W.: Transpose: keypoint localization via transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11802–11812 (2021)
Sinha, A., Choi, C., Ramani, K.: Deephand: robust hand pose estimation by completing a matrix imputed with deep features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4150–4158 (2016)
Ge, L., Liang, H., Yuan, J., Thalmann, D.: Robust 3d hand pose estimation in single depth images: from single-view cnn to multi-view CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3593–3601 (2016)
Moon, G., Chang, J.Y., Lee, K.M.: V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5079–5088 (2018)
Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: Freihand: A dataset for markerless capture of hand pose and shape from single RGB images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 813–822 (2019)
Mueller, F., Bernard, F., Sotnychenko, O., Mehta, D., Sridhar, S., Casas, D., Theobalt, C.: Ganerated hands for real-time 3d hand tracking from monocular RGB. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 49–59 (2018)
Spurr, A., Song, J., Park, S., Hilliges, O.: Cross-modal deep variational hand pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 89–98 (2018)
Iqbal, U., Molchanov, P., Gall, T.B.J., Kautz, J.: Hand pose estimation via latent 2.5 d heatmap regression. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 118–134 (2018)
Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1145–1153 (2017)
Lin, F., Wilhelm, C., Martinez, T.: Two-hand global 3d pose estimation using monocular rgb. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2373–2381 (2021)
Zhang, B., Wang, Y., Deng, X., Zhang, Y., Tan, P., Ma, C., Wang, H.: Interacting two-hand 3d pose and shape reconstruction from single color image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11354–11363 (2021)
Kim, D.U., Kim, K.I., Baek, S.: End-to-end detection and pose estimation of two interacting hands. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11189–11198 (2021)
Hampali, S., Sarkar, S.D., Rad, M., Lepetit, V.: Handsformer: keypoint transformer for monocular 3d pose estimation ofhands and object in interaction. arXiv preprint arXiv:2104.146392 (2021)
He, Y., Yan, R., Fragkiadaki, K., Yu, S.-I.: Epipolar transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7779–7788 (2020)
Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: simple vision transformer baselines for human pose estimation. Adv. Neural Inf. Process. Syst. 35, 38571–38584 (2022)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Proceedings, Part VIII 14, pp. 483–499 (2016). Springer
Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 529–545 (2018)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Moon, G., Chang, J.Y., Lee, K.M.: Camera distance-aware top-down approach for 3d multi-person pose estimation from a single rgb image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10133–10142 (2019)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Zhang, J., Jiao, J., Chen, M., Qu, L., Xu, X., Yang, Q.: A hand pose tracking benchmark from stereo matching. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 982–986 (2017). IEEE
Yang, L., Li, J., Xu, W., Diao, Y., Lu, C.: Bihand: Recovering hand mesh with multi-stage bisected hourglass networks. arXiv preprint arXiv:2008.05079 (2020)
Choi, H., Moon, G., Lee, K.M.: Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 2020, Proceedings, Part VII 16, pp. 769–787 (2020). Springer
Chen, L., Lin, S.-Y., Xie, Y., Tang, H., Xue, Y., Xie, X., Lin, Y.-Y., Fan, W.: Generating realistic training images based on tonality-alignment generative adversarial networks for hand pose estimation. arXiv preprint arXiv:1811.09916 (2018)
Zhao, L., Peng, X., Chen, Y., Kapadia, M., Metaxas, D.N.: Knowledge as priors: Cross-modal knowledge generalization for datasets without superior knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6528–6537 (2020)
Funding
This work was supported by the Key Research and Technology Development Projects of An-hui Province under Grant No. 2022k07020006, the Major Natural Science Re-search Projects in Colleges and Universities of Anhui Province under Grant Nos. KJ2021ZD0004 and 2022AH051160, and the National Key Research and Development Program of China under Grant No. 2020YFF0303803.
Author information
Authors and Affiliations
Contributions
WB performed the formulation or evolution of overarching research goals and model structure. QG performed the creation of models and design of methodology. XY performed the revision of the article and provided material support. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bao, W., Gao, Q. & Yang, X. A hybrid network for estimating 3D interacting hand pose from a single RGB image. SIViP 18, 3801–3814 (2024). https://doi.org/10.1007/s11760-024-03043-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-024-03043-1