Diverse Humanoid Robot Pose Estimation from Images Using Only Sparse Datasets
<p>We present a learning-based full-body pose estimation method for various humanoid robots. Our keypoint detector, trained on an extended pose dataset, consistently estimates humanoid robot poses over time from videos, capturing front, side, back, and partial poses, while excluding human bodies (see supplemental video [<a href="#B11-applsci-14-09042" class="html-bibr">11</a>]). The following robots are shown: (<b>a</b>) Optimus Gen 2 (Tesla) [<a href="#B12-applsci-14-09042" class="html-bibr">12</a>]; (<b>b</b>) Apollo (Apptronik) [<a href="#B13-applsci-14-09042" class="html-bibr">13</a>]; (<b>c</b>) Atlas (Boston Dynamics) [<a href="#B14-applsci-14-09042" class="html-bibr">14</a>]; (<b>d</b>) Darwin-OP (Robotis) [<a href="#B15-applsci-14-09042" class="html-bibr">15</a>]; (<b>e</b>) EVE (1X Technologies) [<a href="#B16-applsci-14-09042" class="html-bibr">16</a>]; (<b>f</b>) FIGURE 01 (Figure) [<a href="#B17-applsci-14-09042" class="html-bibr">17</a>]; (<b>g</b>) H1 (Unitree) [<a href="#B18-applsci-14-09042" class="html-bibr">18</a>]; (<b>h</b>) Kepler (Kepler Exploration Robot) [<a href="#B19-applsci-14-09042" class="html-bibr">19</a>]; (<b>i</b>) Phoenix (Sanctuary AI) [<a href="#B20-applsci-14-09042" class="html-bibr">20</a>]; (<b>j</b>) TALOS (PAL Robotics) [<a href="#B21-applsci-14-09042" class="html-bibr">21</a>]; (<b>k</b>) Toro (DLR) [<a href="#B22-applsci-14-09042" class="html-bibr">22</a>].</p> "> Figure 2
<p>Joint configurations of humanoid robots used in the Diverse Humanoid Robot Pose Dataset (DHRP). (<b>a</b>) Apollo. (<b>b</b>) Atlas. (<b>c</b>) Darwin-OP. (<b>d</b>) EVE. (<b>e</b>) FIGURE 01. (<b>f</b>) H1. (<b>g</b>) Kepler. (<b>h</b>) Optimus Gen 2. (<b>i</b>) Phoenix. (<b>j</b>) TALOS. (<b>k</b>) Toro. Note: Phoenix lacks lower-body data in the dataset.</p> "> Figure 3
<p>Example training images from the DHRP Dataset. (<b>a</b>) Apollo. (<b>b</b>) Atlas. (<b>c</b>) Darwin-OP. (<b>d</b>) EVE. (<b>e</b>) FIGURE 01. (<b>f</b>) H1. (<b>g</b>) Kepler. (<b>h</b>) Optimus Gen 2. (<b>i</b>) Phoenix. (<b>j</b>) TALOS. (<b>k</b>) Toro. Note: Phoenix lacks lower-body data in the dataset.</p> "> Figure 4
<p>Example images from the arbitrary random humanoid robot dataset. These 2k additional images enhance the diversity of body shapes, appearances, and motions for various humanoid robots.</p> "> Figure 5
<p>Example images from the synthetic dataset. The first row displays examples generated through AI-assisted image synthesis using Viggle [<a href="#B54-applsci-14-09042" class="html-bibr">54</a>]. The second row showcases examples created via 3D character simulations using Unreal Engine [<a href="#B55-applsci-14-09042" class="html-bibr">55</a>]. These 6.7k additional annotations enhance the diversity of motions and scenarios for various humanoid robots.</p> "> Figure 6
<p>Example images from the random background dataset. The first row displays examples from the target humanoid robot dataset, while the second row shows the corresponding foreground-removed images in the background dataset, generated using Adobe Photoshop Generative Fill [<a href="#B58-applsci-14-09042" class="html-bibr">58</a>]. This dataset includes 133 AI-assisted foreground-removed images and 1886 random indoor and outdoor background images. The 2k background images, which do not feature humanoid robots, improve the distinction between robots and their backgrounds, particularly in environments with metallic objects that may resemble the robots’ body surfaces.</p> "> Figure 7
<p>Network architecture for the 2D joint detector: Starting with a single input image, the n-stage network generates keypoint coordinates <span class="html-italic">K</span> and their corresponding confidence heat maps <span class="html-italic">H</span>. At each stage, the output from the hourglass module [<a href="#B59-applsci-14-09042" class="html-bibr">59</a>] is passed forward to both the next stage and the Differentiable Spatial-to-Numerical Transform (DSNT) regression module [<a href="#B62-applsci-14-09042" class="html-bibr">62</a>]. The DSNT module then produces both <span class="html-italic">H</span> and <span class="html-italic">K</span>. In the parsing stage, each keypoint <math display="inline"><semantics> <mrow> <msub> <mi>k</mi> <mi>j</mi> </msub> <mspace width="3.33333pt"/> <mo>=</mo> <mspace width="3.33333pt"/> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>y</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>∈</mo> <mi>K</mi> </mrow> </semantics></math> is identified with its associated confidence value <math display="inline"><semantics> <mrow> <msub> <mi>c</mi> <mi>j</mi> </msub> <mspace width="3.33333pt"/> <mo>=</mo> <mspace width="3.33333pt"/> <msub> <mi>H</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>y</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </semantics></math>.</p> "> Figure 8
<p>Qualitative evaluation on selected frames. The proposed learning-based full-body pose estimation method for various humanoid robots, trained on our DHRP dataset, consistently estimates the poses of humanoid robots over time from video frames, capturing front, side, back, and partial poses. (<b>a</b>) Apollo. (<b>b</b>) Atlas. (<b>c</b>) DARwln-OP. (<b>d</b>) EVE. (<b>e</b>) FIGURE 01. (<b>f</b>) H1. (<b>g</b>) Kepler. (<b>h</b>) Phoenix. (<b>i</b>) TALOS. (<b>j</b>) Toro.</p> "> Figure 9
<p>Qualitative evaluation on selected frames. Full-body pose estimation results for miniature humanoid robot models not included in the DHRP dataset. These results demonstrate that our method can be extended to other types of humanoid robots.</p> "> Figure 10
<p>Common failure cases: (<b>a</b>–<b>c</b>) False part detections caused by interference from nearby metallic objects. (<b>d</b>) False negatives due to interference from objects with similar appearances. (<b>e</b>) False negatives in egocentric views caused by the rarity of torso body part observations. (<b>f</b>) False positives on non-humanoid robots.</p> ">
Abstract
:1. Introduction
- The first pose dataset providing full-body annotations for over ten different types of humanoid robots.
- A benchmark method for real-time, learning-based full-body pose estimation from a single image.
- An evaluation dataset and standard metrics for assessing pose estimation performance.
2. Related Work
2.1. Learning-Based Human Pose Estimation
2.2. Marker-Based Human–Robot Interaction
2.3. Robot Arm Pose Estimation
2.4. Full-Body Robot Pose Estimation
3. Diverse Humanoid Robot Pose Dataset (DHRP)
3.1. Target Humanoid Robots
3.2. Sparse Dataset and Annotation Process
3.3. Dataset Extension
3.4. Arbitrary Humanoid Robot Dataset
3.5. Synthetic Dataset
3.5.1. AI-Assisted Image Synthesis
3.5.2. 3D Character Simulations
3.6. Background Dataset
3.6.1. AI-Assisted Foreground Removal
3.6.2. Arbitrary Random Backgrounds
4. Diverse Humanoid Robot Pose Estimation
4.1. The Benchmark Method
4.1.1. Joint Detection Network
4.1.2. Keypoint Parsing
4.1.3. Network Optimization
4.1.4. Training Details
4.2. The Evaluation Method
5. Results and Evaluation
5.1. Evaluation on Dataset Configurations
5.2. Evaluation on Sparse Datasets
5.3. Evaluation on Network Architecture
5.4. Comparison with Other Methods
5.5. Quantitative Evaluation on Target Humanoid Robots
5.6. Qualitative Evaluation on Target Humanoid Robots
5.7. Failure Case Analysis
6. Limitations and Future Work
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Saeedvand, S.; Jafari, M.; Aghdasi, H.S.; Baltes, J. A comprehensive survey on humanoid robot development. Knowl. Eng. Rev. 2019, 34, e20. [Google Scholar] [CrossRef]
- Tong, Y.; Liu, H.; Zhang, Z. Advancements in humanoid robots: A comprehensive review and future prospects. IEEE/CAA J. Autom. Sin. 2024, 11, 301–328. [Google Scholar] [CrossRef]
- Darvish, K.; Penco, L.; Ramos, J.; Cisneros, R.; Pratt, J.; Yoshida, E.; Ivaldi, S.; Pucci, D. Teleoperation of humanoid robots: A survey. IEEE Trans. Robot. 2023, 39, 1706–1727. [Google Scholar] [CrossRef]
- Suzuki, R.; Karim, A.; Xia, T.; Hedayati, H.; Marquardt, N. Augmented reality and robotics: A survey and taxonomy for ar-enhanced human-robot interaction and robotic interfaces. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA, 29 April–5 May 2022; pp. 1–33. [Google Scholar]
- Miseikis, J.; Knobelreiter, P.; Brijacak, I.; Yahyanejad, S.; Glette, K.; Elle, O.J.; Torresen, J. Robot localisation and 3D position estimation using a free-moving camera and cascaded convolutional neural networks. In Proceedings of the 2018 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Auckland, New Zealand, 9–12 July 2018; pp. 181–187. [Google Scholar]
- Lee, T.E.; Tremblay, J.; To, T.; Cheng, J.; Mosier, T.; Kroemer, O.; Fox, D.; Birchfield, S. Camera-to-robot pose estimation from a single image. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 13 May–31 August 2020; pp. 9426–9432. [Google Scholar]
- Lu, J.; Richter, F.; Yip, M.C. Pose estimation for robot manipulators via keypoint optimization and sim-to-real transfer. IEEE Robot. Autom. Lett. 2022, 7, 4622–4629. [Google Scholar] [CrossRef]
- Tejwani, R.; Ma, C.; Bonato, P.; Asada, H.H. An Avatar Robot Overlaid with the 3D Human Model of a Remote Operator. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 7061–7068. [Google Scholar]
- Amini, A.; Farazi, H.; Behnke, S. Real-time pose estimation from images for multiple humanoid robots. In Robot World Cup; Springer: Cham, Switzerland, 2021; pp. 91–102. [Google Scholar]
- Cho, Y.; Son, W.; Bak, J.; Lee, Y.; Lim, H.; Cha, Y. Full-Body Pose Estimation of Humanoid Robots Using Head-Worn Cameras for Digital Human-Augmented Robotic Telepresence. Mathematics 2024, 12, 3039. [Google Scholar] [CrossRef]
- Supplementary Video. Available online: https://xrlabku.webflow.io/papers/diverse-humanoid-robot-pose-estimation-using-only-sparse-datasets (accessed on 2 October 2024).
- Tesla. Optimus Gen2. 2023. Available online: https://www.youtube.com/@tesla (accessed on 20 August 2024).
- Apptronik. Apollo. 2023. Available online: https://apptronik.com/apollo/ (accessed on 20 August 2024).
- Boston Dynamics. Atlas. 2016. Available online: https://bostondynamics.com/atlas/ (accessed on 20 August 2024).
- Robotis. DARwln-OP. 2011. Available online: https://emanual.robotis.com/docs/en/platform/op/getting_started/ (accessed on 20 August 2024).
- 1X Technologies. EVE. 2023. Available online: https://www.1x.tech/androids/eve (accessed on 20 August 2024).
- Figure. FIGURE01. 2023. Available online: https://www.figure.ai/ (accessed on 20 August 2024).
- Unitree. H1. 2023. Available online: https://www.unitree.com/h1/ (accessed on 20 August 2024).
- Kepler Exploration Robot. Kepler. 2023. Available online: https://www.gotokepler.com/home (accessed on 20 August 2024).
- Sanctuary AI. Phoenix. 2023. Available online: https://sanctuary.ai/product/ (accessed on 20 August 2024).
- PAL Robotics. TALOS. 2017. Available online: https://pal-robotics.com/robot/talos/ (accessed on 20 August 2024).
- DLR. Toro. 2013. Available online: https://www.dlr.de/en/rm/research/robotic-systems/humanoids/toro (accessed on 20 August 2024).
- Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Artacho, B.; Savakis, A. Omnipose: A multi-scale framework for multi-person pose estimation. arXiv 2021, arXiv:2103.10180. [Google Scholar]
- Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. Vitpose: Simple vision transformer baselines for human pose estimation. Adv. Neural Inf. Process. Syst. 2022, 35, 38571–38584. [Google Scholar]
- Huang, Y.; Kaufmann, M.; Aksan, E.; Black, M.J.; Hilliges, O.; Pons-Moll, G. Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. ACM Trans. Graph. 2018, 37, 1–15. [Google Scholar] [CrossRef]
- Von Marcard, T.; Henschel, R.; Black, M.J.; Rosenhahn, B.; Pons-Moll, G. Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 601–617. [Google Scholar]
- Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A.A.A.; Tzionas, D.; Black, M.J. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–19 June 2019. [Google Scholar]
- Kocabas, M.; Athanasiou, N.; Black, M.J. VIBE: Video Inference for Human Body Pose and Shape Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Guzov, V.; Mir, A.; Sattler, T.; Pons-Moll, G. Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4318–4329. [Google Scholar]
- Zheng, C.; Zhu, S.; Mendieta, M.; Yang, T.; Chen, C.; Ding, Z. 3D Human Pose Estimation With Spatial and Temporal Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 11656–11665. [Google Scholar]
- Gong, J.; Foo, L.G.; Fan, Z.; Ke, Q.; Rahmani, H.; Liu, J. DiffPose: Toward More Reliable 3D Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13041–13051. [Google Scholar]
- Tang, Z.; Qiu, Z.; Hao, Y.; Hong, R.; Yao, T. 3D Human Pose Estimation With Spatio-Temporal Criss-Cross Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 4790–4799. [Google Scholar]
- Shan, W.; Liu, Z.; Zhang, X.; Wang, Z.; Han, K.; Wang, S.; Ma, S.; Gao, W. Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14761–14771. [Google Scholar]
- Einfalt, M.; Ludwig, K.; Lienhart, R. Uplift and Upsample: Efficient 3D Human Pose Estimation With Uplifting Transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2903–2913. [Google Scholar]
- Jiang, Z.; Zhou, Z.; Li, L.; Chai, W.; Yang, C.Y.; Hwang, J.N. Back to Optimization: Diffusion-Based Zero-Shot 3D Human Pose Estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 6142–6152. [Google Scholar]
- Bambuŝek, D.; Materna, Z.; Kapinus, M.; Beran, V.; Smrž, P. Combining interactive spatial augmented reality with head-mounted display for end-user collaborative robot programming. In Proceedings of the 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), New Delhi, India, 14–18 October 2019; pp. 1–8. [Google Scholar]
- Qian, L.; Deguet, A.; Wang, Z.; Liu, Y.H.; Kazanzides, P. Augmented reality assisted instrument insertion and tool manipulation for the first assistant in robotic surgery. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 5173–5179. [Google Scholar]
- Tran, N. Exploring Mixed Reality Robot Communication under Different Types of Mental Workload; Colorado School of Mines: Golden, CO, USA, 2020. [Google Scholar]
- Frank, J.A.; Moorhead, M.; Kapila, V. Mobile mixed-reality interfaces that enhance human–robot interaction in shared spaces. Front. Robot. AI 2017, 4, 20. [Google Scholar] [CrossRef]
- Ban, S.; Fan, J.; Zhu, W.; Ma, X.; Qiao, Y.; Wang, Y. Real-time Holistic Robot Pose Estimation with Unknown States. arXiv 2024, arXiv:2402.05655. [Google Scholar]
- Tian, Y.; Zhang, J.; Yin, Z.; Dong, H. Robot structure prior guided temporal attention for camera-to-robot pose estimation from image sequence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8917–8926. [Google Scholar]
- Rodrigues, I.R.; Dantas, M.; de Oliveira Filho, A.T.; Barbosa, G.; Bezerra, D.; Souza, R.; Marquezini, M.V.; Endo, P.T.; Kelner, J.; Sadok, D. A framework for robotic arm pose estimation and movement prediction based on deep and extreme learning models. J. Supercomput. 2023, 79, 7176–7205. [Google Scholar] [CrossRef]
- Olson, E. AprilTag: A robust and flexible visual fiducial system. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 3400–3407. [Google Scholar] [CrossRef]
- Kalaitzakis, M.; Cain, B.; Carroll, S.; Ambrosi, A.; Whitehead, C.; Vitzilaios, N. Fiducial markers for pose estimation: Overview, applications and experimental comparison of the artag, apriltag, aruco and stag markers. J. Intell. Robot. Syst. 2021, 101, 1–26. [Google Scholar] [CrossRef]
- Ilonen, J.; Kyrki, V. Robust robot-camera calibration. In Proceedings of the 2011 15th International Conference on Advanced Robotics (ICAR), Tallinn, Estonia, 20–23 June 2011; pp. 67–74. [Google Scholar] [CrossRef]
- Davis, L.; Clarkson, E.; Rolland, J.P. Predicting accuracy in pose estimation for marker-based tracking. In Proceedings of the Second IEEE and ACM International Symposium on Mixed and Augmented Reality, Tokyo, Japan, 10 October 2003; pp. 28–35. [Google Scholar]
- Ebmer, G.; Loch, A.; Vu, M.N.; Mecca, R.; Haessig, G.; Hartl-Nesic, C.; Vincze, M.; Kugi, A. Real-Time 6-DoF Pose Estimation by an Event-Based Camera Using Active LED Markers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 8137–8146. [Google Scholar]
- Ishida, M.; Shimonomura, K. Marker based camera pose estimation for underwater robots. In Proceedings of the 2012 IEEE/SICE International Symposium on System Integration (SII), Fukuoka, Japan, 16–18 December 2012; pp. 629–634. [Google Scholar]
- Garrido-Jurado, S.; Muñoz-Salinas, R.; Madrid-Cuevas, F.; Marín-Jiménez, M. Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognit. 2014, 47, 2280–2292. [Google Scholar] [CrossRef]
- Romero-Ramire, F.J.; Munoz-Salinas, R.; Medina-Carnicer, R. Fractal markers: A new approach for long-range marker pose estimation under occlusion. IEEE Access 2019, 7, 169908–169919. [Google Scholar] [CrossRef]
- Di Giambattista, V.; Fawakherji, M.; Suriani, V.; Bloisi, D.D.; Nardi, D. On Field Gesture-Based Robot-to-Robot Communication with NAO Soccer Players. In Proceedings of the RoboCup 2019: Robot World Cup XXIII, Sydney, Australia, 2–8 July 2019; Chalup, S., Niemueller, T., Suthakorn, J., Williams, M.A., Eds.; Springer: Cham, Switzerland, 2019; pp. 367–375. [Google Scholar]
- V7 Labs. V7 Darwin. 2019. Available online: https://www.v7labs.com/darwin/ (accessed on 9 September 2024).
- Viggle. Available online: https://viggle.ai/ (accessed on 23 August 2024).
- Epic Games. Unreal Engine. 1995. Available online: https://www.unrealengine.com/ (accessed on 9 September 2024).
- Cha, Y.W.; Shaik, H.; Zhang, Q.; Feng, F.; State, A.; Ilie, A.; Fuchs, H. Mobile. Egocentric human body motion reconstruction using only eyeglasses-mounted cameras and a few body-worn inertial sensors. In Proceedings of the 2021 IEEE Virtual Reality and 3D User Interfaces (VR), Lisboa, Portugal, 27 March–1 April 2021; pp. 616–625. [Google Scholar]
- Akada, H.; Wang, J.; Shimada, S.; Takahashi, M.; Theobalt, C.; Golyanik, V. Unrealego: A new dataset for robust egocentric 3d human motion capture. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 1–17. [Google Scholar]
- Adobe Photoshop Generative Fill. Available online: https://www.adobe.com/products/photoshop/generative-fill.html (accessed on 23 August 2024).
- Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VIII 14. Springer: Cham, Switzerland, 2016; pp. 483–499. [Google Scholar]
- Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3686–3693. [Google Scholar]
- Lovanshi, M.; Tiwari, V. Human pose estimation: Benchmarking deep learning-based methods. In Proceedings of the 2022 IEEE Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation (IATMSI), Gwalior, India, 21–23 December 2022; pp. 1–6. [Google Scholar]
- Nibali, A.; He, Z.; Morgan, S.; Prendergast, L. Numerical coordinate regression with convolutional neural networks. arXiv 2018, arXiv:1801.07372. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Wang, J.; Perez, L. The effectiveness of data augmentation in image classification using deep learning. Convolutional Neural Netw. Vis. Recognit. 2017, 11, 1–8. [Google Scholar]
- Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 1–48. [Google Scholar] [CrossRef]
- Yang, S.; Xiao, W.; Zhang, M.; Guo, S.; Zhao, J.; Shen, F. Image data augmentation for deep learning: A survey. arXiv 2022, arXiv:2204.08610. [Google Scholar]
- Pytorch. Available online: https://pytorch.org (accessed on 12 September 2024).
- Open Neural Network Exchange. Available online: https://onnx.ai (accessed on 12 September 2024).
Method | Keypoints | Target Joints | Target Robots |
---|---|---|---|
(a) Robot Arm Pose Estimation [5,6,7] | Arm Joints | Less than 3 | |
(b) Humanoid Robot Pose Estimation [8,9] | Partial-Body Joints | Less than 3 | |
(c) Diverse Humanoid Robot Pose Estimation (Ours) | Full-body Joints | 11 commercial robots |
Humanoid Robot | Train Size | Test Size |
---|---|---|
Apollo (Apptronik) [13] | 530 | 118 |
Atlas (Boston Dynamics) [14] | 337 | 141 |
DARwln-OP (Robotis) [15] | 348 | 134 |
EVE (1X Technologies) [16] | 512 | 100 |
FIGURE 01 (Figure) [17] | 261 | 146 |
H1 (Unitree) [18] | 200 | 110 |
Kepler (Kepler Exploration Robot) [19] | 332 | 136 |
Optimus Gen 2 (Tesla) [12] | 474 | 135 |
Phoenix (Sanctuary AI) [20] | 233 | 168 |
TALOS (PAL Robotics) [21] | 358 | 108 |
Toro (DLR) [22] | 233 | 158 |
Total | 3818 | 1454 |
Train Set | Size | Test Set | Size |
---|---|---|---|
Real (11 target humanoid robots) | 3818 | Real (11 target humanoid robots) | 1454 |
Real (arbitrary humanoid robots) | 2027 | - | - |
Synthetic dataset | 6733 | - | - |
Random Backgrounds | 2019 | - | - |
Total | 14,597 | Total | 1454 |
Augment. Type | Augment. Method | Prob. | Range |
---|---|---|---|
Motion jitter | Image scale | 0.8 | 0.5–1.6 |
Motion jitter | Image rotation | 0.8 | ≤90° |
Motion jitter | Image translation | 0.8 | ≤0.3 × image height |
Motion jitter | Horizontal flip | 0.5 | - |
Color jitter | Pixel contrast | 0.8 | ≤×0.2 |
Color jitter | Pixel brightness | 0.8 | ≤±30 |
Noise jitter | Gaussian blur | 0.4 | |
Noise jitter | Salt-and-pepper noise | 0.3 | ≤±25 |
Configuration | Model | ||||||
---|---|---|---|---|---|---|---|
Dataset A | 4-stage HG | 73.9 | 81.9 | 61.6 | 51.7 | 50.8 | 31.8 |
Dataset B | 4-stage HG | 80.9 | 89.9 | 74.0 | 68.4 | 73.2 | 51.4 |
Dataset C | 4-stage HG | 83.5 | 91.7 | 79.3 | 73.7 | 79.6 | 60.7 |
Dataset D | 4-stage HG | 84.9 | 93.2 | 82.1 | 75.1 | 80.6 | 64.5 |
Configuration | Model | Nose | Neck | Sho | Elb | Wri | Hip | Knee | Ank | |
---|---|---|---|---|---|---|---|---|---|---|
Dataset A | 4-stage HG | 78.6 | 79.1 | 73.9 | 61.4 | 45.0 | 73.3 | 77.3 | 69.3 | 69.7 |
Dataset B | 4-stage HG | 87.0 | 88.8 | 85.4 | 70.4 | 63.7 | 84.1 | 80.6 | 77.9 | 79.7 |
Dataset C | 4-stage HG | 89.6 | 88.9 | 87.7 | 76.3 | 67.4 | 88.4 | 85.9 | 80.8 | 83.1 |
Dataset D | 4-stage HG | 89.9 | 89.9 | 88.2 | 76.8 | 70.3 | 89.0 | 85.0 | 80.5 | 83.7 |
Configuration | Model | ||||||
---|---|---|---|---|---|---|---|
Train A | 4-stage HG | 64.7 | 72.3 | 46.0 | 27.3 | 23.7 | 13.3 |
Train B | 4-stage HG | 82.6 | 90.9 | 77.7 | 65.4 | 69.9 | 47.6 |
Train C | 4-stage HG | 84.9 | 93.2 | 82.1 | 75.1 | 80.6 | 64.5 |
Configuration | Apo. | Atl. | Dar. | Eve | Fig. | H1 | Kep. | Opt. | Pho. | Tal. | Tor. | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Individual | 51.0 | 48.8 | 57.8 | 61.3 | 38.4 | 27.8 | 49.9 | 90.8 | 50.9 | 34.3 | 42.6 | 58.4 |
Dataset S | 58.5 | 79.2 | 56.9 | 54.5 | 65.9 | 32.2 | 72.1 | 79.0 | 73.6 | 51.3 | 29.9 | 55.0 |
Dataset A | 73.9 | 71.4 | 79.9 | 66.9 | 75.3 | 49.4 | 85.1 | 90.6 | 64.4 | 75.0 | 81.6 | 77.5 |
Dataset D | 84.9 | 85.9 | 88.1 | 79.7 | 82.4 | 69.7 | 95.4 | 94.3 | 80.0 | 83.5 | 91.8 | 87.2 |
Leave-One-Out | Apo. | Atl. | Dar. | Eve | Fig. | H1 | Kep. | Opt. | Pho. | Tal. | Tor. | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Using All | 75.1 | 74.5 | 80.6 | 73.9 | 72.3 | 63.0 | 89.9 | 90.8 | 67.7 | 62.0 | 81.2 | 77.5 |
w/o Apo. | 65.6 | 52.5 | 74.8 | 65.9 | 62.4 | 55.3 | 83.7 | 83.2 | 61.3 | 48.3 | 66.3 | 71.9 |
w/o Atl. | 62.1 | 62.0 | 38.0 | 67.1 | 69.1 | 50.7 | 84.0 | 84.3 | 56.3 | 43.5 | 67.3 | 72.5 |
w/o Dar. | 65.0 | 63.6 | 75.6 | 47.5 | 61.4 | 54.8 | 82.5 | 82.6 | 59.5 | 46.0 | 67.9 | 78.8 |
w/o Eve | 64.8 | 59.2 | 76.7 | 65.8 | 29.5 | 56.4 | 85.3 | 84.9 | 59.0 | 46.8 | 74.3 | 73.4 |
w/o Fig. | 65.9 | 60.3 | 73.8 | 68.4 | 70.1 | 48.5 | 86.2 | 82.4 | 63.6 | 47.2 | 64.0 | 69.8 |
w/o H1 | 66.5 | 61.0 | 75.8 | 69.9 | 63.1 | 60.9 | 51.3 | 84.8 | 63.2 | 50.8 | 75.7 | 74.7 |
w/o Kep. | 62.5 | 62.0 | 74.1 | 61.3 | 65.2 | 54.7 | 80.8 | 63.7 | 55.9 | 44.5 | 56.7 | 74.1 |
w/o Opt. | 66.7 | 62.3 | 72.9 | 67.0 | 64.7 | 57.3 | 83.7 | 83.8 | 56.2 | 49.5 | 66.4 | 74.7 |
w/o Pho. | 64.3 | 60.6 | 73.0 | 63.7 | 66.7 | 62.1 | 83.3 | 85.6 | 60.0 | 22.0 | 73.1 | 71.0 |
w/o Tal. | 62.1 | 66.5 | 74.3 | 66.4 | 54.8 | 55.0 | 82.2 | 85.8 | 61.9 | 43.6 | 18.1 | 71.1 |
w/o Tor. | 63.9 | 59.4 | 76.6 | 65.9 | 64.0 | 59.4 | 84.6 | 87.4 | 58.1 | 48.3 | 73.1 | 39.1 |
Model | ||||||
---|---|---|---|---|---|---|
4-stage hourglass | 84.9 | 93.2 | 82.1 | 75.1 | 80.6 | 64.5 |
5-stage hourglass | 86.7 | 94.3 | 84.5 | 79.2 | 86.3 | 70.4 |
6-stage hourglass | 86.4 | 93.9 | 83.9 | 79.8 | 86.9 | 71.7 |
7-stage hourglass | 87.4 | 95.3 | 86.2 | 81.5 | 88.6 | 74.4 |
8-stage hourglass | 83.6 | 92.5 | 78.1 | 75.2 | 81.9 | 62.5 |
Method | Nose | Neck | Sho | Elb | Wri | Hip | Knee | Ank | |
---|---|---|---|---|---|---|---|---|---|
4-stage hourglass | 89.9 | 89.9 | 88.2 | 76.8 | 70.3 | 89.0 | 85.0 | 80.5 | 83.7 |
5-stage hourglass | 90.6 | 90.9 | 90.3 | 79.0 | 73.4 | 91.3 | 87.5 | 83.9 | 85.9 |
6-stage hourglass | 89.4 | 90.2 | 89.5 | 80.7 | 75.5 | 90.2 | 88.0 | 84.4 | 86.0 |
7-stage hourglass | 92.1 | 91.1 | 91.2 | 80.6 | 75.5 | 92.4 | 88.0 | 86.2 | 87.1 |
8-stage hourglass | 88.7 | 88.4 | 88.7 | 78.0 | 69.1 | 87.7 | 86.4 | 83.0 | 83.8 |
Model | Parameters | GFLOPs | Training (h) | Inference (FPS) |
---|---|---|---|---|
4-stage hourglass | 13.0 M | 48.66 | 10.5 | 63.64 |
5-stage hourglass | 16.2 M | 58.76 | 12.8 | 54.56 |
6-stage hourglass | 19.3 M | 68.88 | 14.6 | 48.34 |
7-stage hourglass | 22.4 M | 78.98 | 17.7 | 42.81 |
8-stage hourglass | 25.6 M | 89.10 | 19.8 | 38.34 |
Method | Input Size | Backbone | Parameters | GFLOPs | Inference (FPS) |
---|---|---|---|---|---|
RoboCup (NimbRo-Net2) [9] | 384 | ResNet18 | 12.8 M | 28.0 | 48 |
Ours (4-stage) | 320 | Hourglass | 13.0 M | 48.66 | 63.64 |
Ours (7-stage) | 320 | Hourglass | 22.4 M | 78.98 | 42.81 |
Method | Nose | Neck | Sho | Elb | Wri | Hip | Knee | Ank | |
---|---|---|---|---|---|---|---|---|---|
RoboCup (NimbRo-Net2) [9] | 49.9 | 64.1 | 49.1 | 25.5 | 10.5 | 39.1 | 23.8 | 30.2 | 36.5 |
Ours (4-stage) | 89.9 | 89.9 | 88.2 | 76.8 | 70.3 | 89.0 | 85.0 | 80.5 | 83.7 |
Ours (7-stage) | 92.1 | 91.1 | 91.2 | 80.6 | 75.5 | 92.4 | 88.0 | 86.2 | 87.1 |
Method | mPCK | [email protected] | [email protected] | [email protected] | [email protected] | [email protected] |
---|---|---|---|---|---|---|
Human+Robot Telepresence [10] | 85.10 | 99.99 | 99.98 | 98.47 | 80.89 | 46.19 |
Ours (7-stage) | 85.76 | 100.0 | 100.0 | 98.73 | 82.56 | 47.52 |
Class | Model | ||||||
---|---|---|---|---|---|---|---|
Total | 7-stage HG | 87.4 | 95.3 | 86.2 | 81.5 | 88.6 | 74.4 |
Apollo | 7-stage HG | 90.0 | 97.5 | 92.4 | 83.6 | 89.8 | 79.7 |
Atlas | 7-stage HG | 87.7 | 95.0 | 88.7 | 83.1 | 91.5 | 76.6 |
Darwin Op | 7-stage HG | 84.1 | 87.3 | 81.3 | 82.6 | 86.6 | 80.6 |
Eve | 7-stage HG | 83.6 | 100 | 90.0 | 82.1 | 99.0 | 85.0 |
Figure01 | 7-stage HG | 72.3 | 85.6 | 54.1 | 69.5 | 79.5 | 50.7 |
H1 | 7-stage HG | 95.9 | 99.1 | 98.2 | 91.6 | 97.3 | 94.5 |
Kepler | 7-stage HG | 95.4 | 97.1 | 97.1 | 90.1 | 95.6 | 89.0 |
Optimus Gen2 | 7-stage HG | 86.1 | 92.6 | 82.2 | 76.6 | 82.2 | 67.4 |
Phoenix | 7-stage HG | 86.9 | 99.4 | 84.5 | 72.1 | 76.8 | 51.2 |
Talos | 7-stage HG | 91.1 | 100 | 97.2 | 87.8 | 98.1 | 89.8 |
Toro | 7-stage HG | 91.5 | 98.1 | 91.1 | 84.3 | 88.6 | 72.2 |
Class | Model | Nose | Neck | Sho | Elb | Wri | Hip | Knee | Ank | |
---|---|---|---|---|---|---|---|---|---|---|
Total | 7-stage HG | 92.1 | 91.1 | 91.2 | 80.6 | 75.5 | 92.4 | 88.0 | 86.2 | 87.1 |
Apollo | 7-stage HG | 92.6 | 93.2 | 91.9 | 83.1 | 71.4 | 95.8 | 84.2 | 98.5 | 88.8 |
Atlas | 7-stage HG | 93.8 | 91.8 | 93.0 | 75.6 | 70.9 | 92.6 | 91.4 | 91.4 | 87.6 |
Darwin Op | 7-stage HG | 92.7 | 83.9 | 91.0 | 78.0 | 79.8 | 98.4 | 94.8 | 81.6 | 87.5 |
Eve | 7-stage HG | 95.4 | 93.4 | 93.4 | 83.8 | 68.9 | 79.5 | 78.3 | 74.1 | 83.3 |
Figure01 | 7-stage HG | 92.7 | 89.8 | 88.6 | 61.5 | 53.2 | 82.1 | 71.5 | 69.6 | 76.1 |
H1 | 7-stage HG | 88.8 | 90.9 | 96.1 | 95.0 | 87.4 | 95.4 | 94.9 | 95.5 | 93.0 |
Kepler | 7-stage HG | 94.2 | 94.1 | 91.3 | 93.0 | 89.8 | 97.2 | 84.1 | 99.3 | 92.9 |
Optimus Gen2 | 7-stage HG | 86.9 | 93.1 | 85.4 | 80.9 | 77.7 | 92.6 | 80.0 | 95.3 | 86.5 |
Phoenix | 7-stage HG | 98.5 | 98.5 | 91.9 | 66.3 | 61.5 | 90.5 | N/A | N/A | 84.5 |
Talos | 7-stage HG | 79.4 | 85.2 | 93.2 | 89.4 | 89.2 | 94.9 | 90.6 | 89.8 | 89.0 |
Toro | 7-stage HG | 91.7 | 87.0 | 89.4 | 90.7 | 87.4 | 89.8 | 94.7 | 82.4 | 89.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Heo, S.; Cho, Y.; Park, J.; Cho, S.; Tsoy, Z.; Lim, H.; Cha, Y. Diverse Humanoid Robot Pose Estimation from Images Using Only Sparse Datasets. Appl. Sci. 2024, 14, 9042. https://doi.org/10.3390/app14199042
Heo S, Cho Y, Park J, Cho S, Tsoy Z, Lim H, Cha Y. Diverse Humanoid Robot Pose Estimation from Images Using Only Sparse Datasets. Applied Sciences. 2024; 14(19):9042. https://doi.org/10.3390/app14199042
Chicago/Turabian StyleHeo, Seokhyeon, Youngdae Cho, Jeongwoo Park, Seokhyun Cho, Ziya Tsoy, Hwasup Lim, and Youngwoon Cha. 2024. "Diverse Humanoid Robot Pose Estimation from Images Using Only Sparse Datasets" Applied Sciences 14, no. 19: 9042. https://doi.org/10.3390/app14199042
APA StyleHeo, S., Cho, Y., Park, J., Cho, S., Tsoy, Z., Lim, H., & Cha, Y. (2024). Diverse Humanoid Robot Pose Estimation from Images Using Only Sparse Datasets. Applied Sciences, 14(19), 9042. https://doi.org/10.3390/app14199042