Abstract
We aim to improve the performance of regressing hand keypoints and segmenting pixel-level hand masks under new imaging conditions (e.g., outdoors) when we only have labeled images taken under very different conditions (e.g., indoors). In the real world, it is important that the model trained for both tasks works under various imaging conditions. However, their variation covered by existing labeled hand datasets is limited. Thus, it is necessary to adapt the model trained on the labeled images (source) to unlabeled images (target) with unseen imaging conditions. While self-training domain adaptation methods (i.e., learning from the unlabeled target images in a self-supervised manner) have been developed for both tasks, their training may degrade performance when the predictions on the target images are noisy. To avoid this, it is crucial to assign a low importance (confidence) weight to the noisy predictions during self-training. In this paper, we propose to utilize the divergence of two predictions to estimate the confidence of the target image for both tasks. These predictions are given from two separate networks, and their divergence helps identify the noisy predictions. To integrate our proposed confidence estimation into self-training, we propose a teacher-student framework where the two networks (teachers) provide supervision to a network (student) for self-training, and the teachers are learned from the student by knowledge distillation. Our experiments show its superiority over state-of-the-art methods in adaptation settings with different lighting, grasping objects, backgrounds, and camera viewpoints. Our method improves by \(4\%\) the multi-task score on HO3D compared to the latest adversarial adaptation method. We also validate our method on Ego4D, egocentric videos with rapid changes in imaging conditions outdoors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Andriluka, M., Pishchulin, L., Gehler, P.V., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3686–3693 (2014)
Arazo, E., Ortego, D., Albert, P., O’Connor, N.E., McGuinness, K.: Pseudo-labeling and confirmation bias in deep semi-supervised learning. In: IEEE International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2020)
Benitez-Garcia, G., et al.: Improving real-time hand gesture recognition with semantic segmentation. Sensors 21(2), 356 (2021)
Blum, A., Mitchell, T.M.: Combining labeled and unlabeled data with co-training. In: Proceedings of the ACM Annual Conference on Computational Learning Theory (COLT), pp. 92–100 (1998)
Boukhayma, A., Bem, R.D., Torr, P.H.S.: 3D hand shape and pose from images in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10843–10852 (2019)
Brahmbhatt, S., Tang, C., Twigg, C.D., Kemp, C.C., Hays, J.: ContactPose: a dataset of grasps with object contact and hand pose. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12358, pp. 361–378. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58601-0_22
Cai, M., Lu, F., Sato, Y.: Generalizing hand segmentation in egocentric videos with uncertainty-guided model adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14380–14389 (2020)
Cai, M., Luo, M., Zhong, X., Chen, H.: Uncertainty-aware model adaptation for unsupervised cross-domain object detection. CoRR, abs/2108.12612 (2021)
Cai, Q., Pan, Y., Ngo, C.-W., Tian, X., Duan, L., Yao, T.: Exploring object relation in mean teacher for cross-domain detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11457–11466 (2019)
Çalli, B., Walsman, A., Singh, A., Srinivasa, S.S., Abbeel, P., Dollar, A.M.: Benchmarking in manipulation research: using the Yale-CMU-Berkeley object and model set. IEEE Robot. Autom. Mag. 22(3), 36–52 (2015)
Cao, J., Tang, H., Fang, H., Shen, X., Tai, Y.-W., Lu, C.: Cross-domain adaptation for animal pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 9497–9506 (2019)
Cao, Z., Radosavovic, I., Kanazawa, A., Malik, J.: Reconstructing hand-object interactions in the wild. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 12417–12426 (2021)
Chao, Y.-W., et al.: DexYCB: a benchmark for capturing hand grasping of objects. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9044–9053 (2021)
Chen, C.-H., et al.: Unsupervised 3D pose estimation with geometric self-supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5714–5724 (2019)
Chen, M., Weinberger, K.Q., Blitzer, J.: Co-training for domain adaptation. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 2456–2464 (2011)
Chen, X., Wang, G., Zhang, C., Kim, T.-K., Ji, X.: SHPR-Net: deep semantic hand pose regression from point clouds. IEEE Access 6, 43425–43439 (2018)
Damen, D., et al.: Rescaling egocentric vision. Int. J. Comput. Vision (IJCV) (2021)
Deng, J., Li, W., Chen, Y., Duan, L.: Unbiased mean teacher for cross-domain object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4091–4101 (2021)
French, G., Mackiewicz, M., Fisher, M.H.: Self-ensembling for visual domain adaptation. In: Proceedings of the International Conference on Learning Representations (ICLR) (2018)
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Zhang, K., Tao, D.: Geometry-consistent generative adversarial networks for one-sided unsupervised domain mapping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2427–2436 (2019)
Fu, Q., Liu, X., Kitani, K.M.: Sequential decision-making for active object detection from hand. CoRR, abs/2110.11524 (2021)
Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1050–1059 (2016)
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1180–1189 (2015)
Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.-K.: First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 409–419 (2018)
Ge, Y., Chen, D., Li, H.: Mutual mean-teaching: pseudo label refinery for unsupervised domain adaptation on person re-identification. In Proceedings of the International Conference on Learning Representations (ICLR) (2020)
Glauser, O., Wu, S., Panozzo, D., Hilliges, O., Sorkine-Hornung, O.: Interactive hand pose estimation using a stretch-sensing soft glove. ACM Trans. Graph. 38(4), 41:1-41:15 (2019)
Goudie, D., Galata, A.: 3D hand-object pose estimation from depth with convolutional neural networks. In: Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition (FG), pp. 406–413 (2017)
Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18995–19012 (2022)
Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: a method for 3D annotation of hand and object poses. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3196–3206 (2020)
Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11807–11816 (2019)
Hidalgo, G., et al.: OpenPose. https://github.com/CMU-Perceptual-Computing-Lab/openpose
Huang, W., Ren, P., Wang, J., Qi, Q., Sun, H.: AWR: adaptive weighting regression for 3D hand pose estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 11061–11068 (2020)
Jiang, J., Ji, Y., Wang, X., Liu, Y., Wang, J., Long, M.: Regressive domain adaptation for unsupervised keypoint detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6780–6789 (2021)
Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3334–3342 (2015)
Kim, S., Chi, H.-G., Hu, X., Vegesana, A., Ramani, K.: First-person view hand segmentation of multi-modal hand activity video dataset. In: Proceedings of the British Machine Vision Conference (BMVC) (2020)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations (ICLR) (2014)
Lee, K., Shrivastava, A., Kacorri, H.: Hand-priming in object localization for assistive egocentric vision. In: IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 3422–3432 (2020)
Li, Y.-J., et al.: Cross-domain object detection via adaptive self-training. CoRR, abs/2111.13216 (2021)
Liang, H., Yuan, J., Thalmann, D., Magnenat-Thalmann, N.: AR in hand: egocentric palm pose tracking and gesture recognition for augmented reality applications. In: Proceedings of the ACM International Conference on Multimedia (MM), pp. 743–744 (2015)
Likitlersuang, J., Sumitro, E.R., Cao, T., Visée, R.J., Kalsi-Ryan, S., Zariffa, J.: Egocentric video: a new tool for capturing hand use of individuals with spinal cord injury at home. J. Neuroeng. Rehabil. (JNER) 16(1), 83 (2019)
Liu, Y.-C., et al.: Unbiased teacher for semi-supervised object detection. In: Proceedings of the International Conference on Learning Representations (ICLR) (2021)
Long, M., Zhu, H., Wang, J., Jordan, M.I.: Unsupervised domain adaptation with residual transfer networks. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 136–144 (2016)
Lu, Y., Mayol-Cuevas, W.W.: Understanding egocentric hand-object interactions from hand pose estimation. CoRR, abs/2109.14657 (2021)
McKee, R., McKee, D., Alexander, D., Paillat, E.: NZ sign language exercises. Deaf Studies Department of Victoria University of Wellington. http://www.victoria.ac.nz/llc/llc_resources/nzsl
Melas-Kyriazi, L., Manrai, A.K.: Pixmatch: unsupervised domain adaptation via pixelwise consistency training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12435–12445 (2021)
Moon, G., Yu, S.-I., Wen, H., Shiratori, T., Lee, K.M.: InterHand2.6M: a dataset and baseline for 3D interacting hand pose estimation from a single RGB image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 548–564. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_33
Mueller, F., et al.: GANerated hands for real-time 3D hand tracking from monocular RGB. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 49–59 (2018)
Mueller, F., Mehta, D., Sotnychenko, O., Sridhar, S., Casas, D., Theobalt, C.: Real-time hand tracking under occlusion from an egocentric RGB-D sensor. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1163–1172 (2017)
Neverova, N., Wolf, C., Nebout, F., Taylor, G.W.: Hand pose estimation through semi-supervised and weakly-supervised learning. Comput. Vis. Image Underst. 164, 56–67 (2017)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Ohkawa, T., Furuta, R., Sato, Y.: Efficient annotation and learning for 3D hand pose estimation: a survey. CoRR, abs/2206.02257 (2022)
Ohkawa, T., Inoue, N., Kataoka, H., Inoue, N.: Augmented cyclic consistency regularization for unpaired image-to-image translation. In: Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 362–369 (2020)
Ohkawa, T., Yagi, T., Hashimoto, A., Ushiku, Y., Sato, Y.: Foreground-aware stylization and consensus pseudo-labeling for domain adaptation of first-person hand segmentation. IEEE Access 9, 94644–94655 (2021)
Pham, H., Dai, Z., Xie, Q., Le, Q.V.: Meta pseudo labels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11557–11568 (2021)
Prabhu, V., Khare, S., Kartik, D., Hoffman, J.: SENTRY: selective entropy optimization via committee consistency for unsupervised domain adaptation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 8558–8567 (2021)
Qian, C., Sun, X., Wei, Y., Tang, X., Sun, J.: Realtime and robust hand tracking from depth. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1106–1113 (2014)
Qiao, S., Shen, W., Zhang, Z., Wang, B., Yuille, A.: Deep co-training for semi-supervised image recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 142–159. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_9
Ren, P., Sun, H., Qi, Q., Wang, J., Huang, W.: SRN: stacked regression network for real-time 3D hand pose estimation. In: Proceedings of the British Machine Vision Conference (BMVC) (2019)
Saito, K., Ushiku, Y., Harada, T.: Asymmetric tri-training for unsupervised domain adaptation. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 2988–2997 (2017)
Santavas, N., Kansizoglou, I., Bampis, L., Karakasis, E., Gasteratos, A.: Attention! A lightweight 2D hand pose estimation approach. CoRR, abs/2001.08047 (2020)
Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4645–4653 (2017)
Sridhar, S., Mueller, F., Zollhoefer, M., Casas, D., Oulasvirta, A., Theobalt, C.: Real-time joint tracking of a hand manipulating an object from RGB-D input. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 294–310 (2016)
Taheri, O., Ghorbani, N., Black, M.J., Tzionas, D.: GRAB: a dataset of whole-body human grasping of objects. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 581–600. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_34
Tarvainen, A., Valpola, H.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: Proceedings of the International Conference on Learning Representations (ICLR) (2017)
Urooj, A., Borji, A.: Analysis of hand segmentation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4710–4719 (2018)
Vasconcelos, L.O., Mancini, M., Boscaini, D., Bulò, S.R., Caputo, B., Ricci, E.: Shape consistent 2D keypoint estimation under domain shift. In: Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 8037–8044 (2020)
Vu, T.H., Jain, H., Bucher, M., Cord, M., Perez, P.: Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2512–2521 (2019)
Wang, Y., Peng, C., Liu, Y.: Mask-pose cascaded CNN for 2D hand pose estimation from single color image. IEEE Trans. Circuits Syst. Video Technol. (TCSVT) 29(11), 3258–3268 (2019)
Wei, S.-E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4732 (2016)
Wu, M.-Y., Ting, P.-W., Tang, Y.-H., Chou, E.T., Fu, L.-C.: Hand pose estimation in object-interaction based on deep learning for virtual reality applications. J. Vis. Commun. Image Represent. 70, 102802 (2020)
Xie, Q., Dai, Z., Hovy, E., Luong, T., Le, Q.: Unsupervised data augmentation for consistency training. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) (2020)
Yan, L., Fan, B., Xiang, S., Pan, C.: CMT: cross mean teacher unsupervised domain adaptation for VHR image semantic segmentation. IEEE Geosci. Remote Sens. Lett. 19, 1–5 (2022)
Yang, L., Chen, S., Yao, A.: Semihand: semi-supervised hand pose estimation with consistency. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 11364–11373 (2021)
Yang, L., Li, J., Xu, W., Diao, Y., Lu, C.: Bihand: recovering hand mesh with multi-stage bisected hourglass networks. In: Proceedings of the British Machine Vision Conference (BMVC) (2020)
Yuan, S., Ye, Q., Stenger, B., Jain, S., Kim, T.K.: BigHand2.2M benchmark: hand pose dataset and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2605–2613 (2017)
Zhang, C., Wang, G., Chen, X., Xie, P., Yamasaki, T.: Weakly supervised segmentation guided hand pose estimation during interaction with unknown objects. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP), pp. 2673–2677 (2020)
Zhou, X., Karpur, A., Gan, C., Luo, L., Huang, Q.: Unsupervised domain adaptation for 3D keypoint estimation via view consistency. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 141–157. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01258-8_9
Zimmermann, C., Argus, M., Brox, T.: Contrastive representation learning for hand shape estimation. CoRR, abs/2106.04324 (2021)
Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4913–4921 (2017)
Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: FreiHAND: a dataset for markerless capture of hand pose and shape from single RGB images. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 813–822 (2019)
Zou, Y., Yu, Z., Kumar, B.V., Wang, J.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 289–305 (2018)
Acknowledgments
This work was supported by JST ACT-X Grant Number JPMJAX2007, JSPS Research Fellowships for Young Scientists, JST AIP Acceleration Research Grant Number JPMJCR20U1, and JSPS KAKENHI Grant Number JP20H04205, Japan. This work was also supported in part by a hardware donation from Yu Darvish.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ohkawa, T., Li, YJ., Fu, Q., Furuta, R., Kitani, K.M., Sato, Y. (2022). Domain Adaptive Hand Keypoint and Pixel Localization in the Wild. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13669. Springer, Cham. https://doi.org/10.1007/978-3-031-20077-9_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-20077-9_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20076-2
Online ISBN: 978-3-031-20077-9
eBook Packages: Computer ScienceComputer Science (R0)