Open AccessArticle

Full-Body Pose Estimation of Humanoid Robots Using Head-Worn Cameras for Digital Human-Augmented Robotic Telepresence

Youngdae Cho

Wooram Son

¹,

Jaewan Bak

Yisoo Lee

²,

Hwasup Lim

and

Youngwoon Cha

^1,*

Department of Metaverse Convergence, Graduate School, Konkuk University, Seoul 05029, Republic of Korea

Center for Intelligent and Interactive Robotics, Korea Institute of Science and Technology, Seoul 02792, Republic of Korea

Center for Artificial Intelligence, Korea Institute of Science and Technology, Seoul 02792, Republic of Korea

Author to whom correspondence should be addressed.

Mathematics 2024, 12(19), 3039; https://doi.org/10.3390/math12193039

Submission received: 3 September 2024 / Revised: 23 September 2024 / Accepted: 26 September 2024 / Published: 28 September 2024

(This article belongs to the Topic Extended Reality: Models and Applications)

Download

Browse Figures

Figure 1
We present a digital human-augmented robotic telepresence system that overlays a remote person onto a humanoid robot using only head-worn cameras worn by the local user. This approach enables immersive 3D telepresence for continuously moving observers during robot teleoperation without the need for markers or knowledge of the robot’s status. Top left: The local user interacts visually with the digital avatar through a head-mounted display (HMD) while physically interacting with the humanoid robot, facilitating both visual and physical immersion. Top right: The remote user communicates via video conferencing through the HMD. Bottom row, <math display="inline"><semantics> <mrow> <mn>1</mn> <mo>×</mo> <mn>3</mn> </mrow> </semantics></math> image group: The remote user operates the robot using wireless motion capture. "> Figure 2
Overview of our digital human-augmented robotic telepresence system: The remote user (left) and the local user (right) are located in separate spaces and are wirelessly connected. The local user interacts with the remote person through a visually immersive digital human avatar and a physically interactive humanoid robot. "> Figure 3
Humanoid robots used in our system: (a) Version 1 equipped with both arms and a head. (b) Version 2 equipped with both arms, legs, and a head. (c) Version 3 equipped with a full body and both hands. "> Figure 4
The digital human avatar used in our system. The digital human model is rescaled to match the size of the humanoid robot in real-world scale. "> Figure 5
Telepresence and capture prototype: (a) Prototype for the remote person, featuring six IMUs for body motion capture and an Apple Vision Pro [<a href="#B52-mathematics-12-03039" class="html-bibr">52</a>] for video conferencing. (b) Prototype for the local person, equipped with XReal Light [<a href="#B53-mathematics-12-03039" class="html-bibr">53</a>] for Augmented Reality (AR) and stereo cameras mounted on top of the HMD for local space observations. "> Figure 6
Network structure for the 2D joint detector: Given a single input image, the 4-stage network outputs the keypoint coordinates K and their confidence maps H. The Hourglass module [<a href="#B58-mathematics-12-03039" class="html-bibr">58</a>] outputs unnormalized heatmaps <math display="inline"><semantics> <mover accent="true"> <mi>H</mi> <mo>˜</mo> </mover> </semantics></math>, which are propagated to the next stage. The DSNT regression module [<a href="#B59-mathematics-12-03039" class="html-bibr">59</a>] normalizes <math display="inline"><semantics> <mover accent="true"> <mi>H</mi> <mo>˜</mo> </mover> </semantics></math> and produces H and K. "> Figure 7
Full-body 3D pose estimation pipeline: Heterogeneous sensor data from the XR headset and stereo camera rig worn by the local user, combined with inertial sensors worn by the remote person, are fused to estimate the full-body pose of the digital human from the local user’s perspective. "> Figure 8
Qualitative evaluation of the joint detector, showing selected frames of full-body humanoid robot pose estimation from head-worn camera views. The joint detector accurately identifies full-body joints even during fast head movements with motion blur, when parts of the robot are out of frame, or at varying distances, allowing for close-range interactions with the user. "> Figure 9
Qualitative evaluation in world space. Each row presents SMPL [<a href="#B41-mathematics-12-03039" class="html-bibr">41</a>] overlay results from a viewpoint in global space: (a) World space view. (b) Mocap using ground truth pose for the lower body, root rotation, and position. (c) Smplify [<a href="#B40-mathematics-12-03039" class="html-bibr">40</a>] using ground truth position. (d) Ours, operating without external knowledge. The checkerboard pattern on the ground is used solely for external camera calibration in global space. "> Figure 10
Qualitative evaluation on head-worn views. Each row displays SMPL [<a href="#B41-mathematics-12-03039" class="html-bibr">41</a>] overlay results corresponding to the frames in <a href="#mathematics-12-03039-f009" class="html-fig">Figure 9</a> from the input camera perspective: (a) Input head-worn view. (b) Mocap using ground truth pose for the lower body, root rotation, and position. (c) Smplify [<a href="#B40-mathematics-12-03039" class="html-bibr">40</a>] using ground truth position. (d) Ours, operating without external knowledge. The checkerboard pattern on the ground is used solely for external camera calibration in global space. "> Figure 11
Selected frames of digital human visualizations captured directly from the head-mounted display. The photos were taken externally, showing the view as visualized to the user’s eyes through the XR glasses. ">

Versions Notes

Abstract

We envision a telepresence system that enhances remote work by facilitating both physical and immersive visual interactions between individuals. However, during robot teleoperation, communication often lacks realism, as users see the robot’s body rather than the remote individual. To address this, we propose a method for overlaying a digital human model onto a humanoid robot using XR visualization, enabling an immersive 3D telepresence experience. Our approach employs a learning-based method to estimate the 2D poses of the humanoid robot from head-worn stereo views, leveraging a newly collected dataset of full-body poses for humanoid robots. The stereo 2D poses and sparse inertial measurements from the remote operator are optimized to compute 3D poses over time. The digital human is localized from the perspective of a continuously moving observer, utilizing the estimated 3D pose of the humanoid robot. Our moving camera-based pose estimation method does not rely on any markers or external knowledge of the robot’s status, effectively overcoming challenges such as marker occlusion, calibration issues, and dependencies on headset tracking errors. We demonstrate the system in a remote physical training scenario, achieving real-time performance at 40 fps, which enables simultaneous immersive and physical interactions. Experimental results show that our learning-based 3D pose estimation method, which operates without prior knowledge of the robot, significantly outperforms alternative approaches requiring the robot’s global pose, particularly during rapid headset movements, achieving markerless digital human augmentation from head-worn views.

Keywords:

telepresence; augmented reality; teleoperation; computer vision

MSC:

68U05

1. Introduction

A digital presence in remote locations facilitates social interaction between geographically separated sites. Telepresence systems eliminate the need for physical travel by utilizing digital avatars as replacements for physical presence. We anticipate that such telepresence systems will gain significant traction and become increasingly commonplace within daily work environments. Furthermore, we expect 3D telepresence systems to reduce travel costs and greatly enhance productivity through remote working in everyday life.

Existing immersive 3D telepresence systems [1,2,3], shown in Table 1a, demonstrate promising results for real-time remote presence by visualizing a digital human avatar of a remote person in 3D, enabling full-body visual interactions between remote individuals. However, despite offering visual immersion of the remote person, such systems lack the capability to facilitate physical interactions with them. This limitation includes the inability to physically manipulate remote objects, such as touching and holding them, thereby restricting the potential benefits of remote working through telepresence.

Mobile Robotic Telepresence (MRP) systems [4,5], shown in Table 1b, offer users mobility in remote sites by enabling remote control of telepresence robots. Users navigate through remote locations using shared 2D camera views from the robot. However, existing commercial telepresence robots, which lack human-like features such as arms, are unable to physically interact with objects in remote sites. This limitation restricts the level of interaction possible between remote individuals.

Prior robot teleoperation systems [6,7,8,9,10], shown in Table 1c, support users in making remote physical contact through the manipulation of humanoid robots. These robots share 2D camera views from remote sites and perform physical actions in the environment based on the user’s remote operations. However, existing systems rely on complicated interfaces for the robot controllers and are difficult to apply to different types of robots. Additionally, the robot operation is limited to the area directly in front of the controllers. These features make the system location-restricted and difficult to use conveniently in everyday workplaces.

Table 1. Comparison of remote interactions between telepresence systems.

Type	Visual Interactions	Physical Interactions
(a) Immersive 3D Telepresence [1]	3D (Digital Humans)	N/A
(b) Mobile Robotic Telepresence [11]	2D (Video)	N/A
(c) Robot Teleoperation [8]	2D (Video)	Humanoid Robots (Robot-specific Complicated Control)
(d) Digital Human-Augmented Robotic Telepresence (Ours)	3D (Digital Humans)	Humanoid Robots (Robot-independent Simple Control)

In this paper, we introduce a novel real-time 3D telepresence system designed for both physical interactions in remote locations and immersive 3D visual interactions between remote individuals, shown in Table 1d. The remote user is represented in two forms: a motion-retargeted humanoid robot and a digital human overlaid on the robot in a separate space. Equipped with our inertial sensor-based motion capture system, the remote user can conveniently control the humanoid robot and interact with remote objects through it. With the body-worn motion capture system, users can wirelessly operate the robot from any location, eliminating the need for complex robot control systems. An overview of our digital human-augmented robotic telepresence system is provided in Figure 1.

Rather than encountering a metallic robot that lacks realism, our system enhances the sense of reality by overlaying a digital human on the robot. The local user engages with the remote person’s digital avatar through a head-mounted display (HMD), facilitating immersive 3D visual interactions.

A key challenge in overlaying a digital human onto a humanoid robot is consistently estimating the robot’s full-body pose over time from the perspective of a moving person. Errors can arise from various factors, such as Visual SLAM (Simultaneous Localization and Mapping) inaccuracies in tracking the user’s head, occlusions of body parts, or parts of the robot being outside the camera’s view.

Visual SLAM is employed to track the user’s headset relative to the world space, with digital human augmentation occurring in the coordinate frame of the moving headset. While external knowledge of the target robot’s full-body pose can be provided, inaccuracies in the digital human overlay tend to worsen with increased head movement. Consequently, the accuracy of the digital human overlay is heavily dependent on the performance of the Visual SLAM system, leading to greater discrepancies during rapid head motions.

Previous approaches (e.g., [12,13]) rely on marker-based pose estimation to align the digital human avatar with the robot. However, these methods require robot-specific marker configurations and precise calibration, posing challenges when applied to different types of robots.

Our approach addresses the challenges posed by headset tracking errors by directly estimating the robot’s pose from images captured by moving head-worn cameras. This method allows for pose estimation that operates independently of runtime headset tracking, enabling our system to function without relying on external knowledge of the robot’s global pose in arbitrary environments.

Additionally, we employ a learning-based method for full-body pose estimation in humanoid robots that does not require markers. This method maintains the visual fidelity of the robot and can be adapted to any human-like robot, provided the appropriate dataset is available. As a result, it eliminates the need for marker calibration and mitigates issues related to occlusions or invisible markers.

We demonstrate our system in a remotely assisted physical training scenario, showcasing both immersive and physical telepresence capabilities. To support training and evaluation, we collected a new large-scale dataset for humanoid robot pose estimation. To the best of our knowledge, no existing dataset includes full-body pose annotations for humanoid robots [14]. We plan to make our dataset publicly available to contribute to the research community. In our experiments, our learning-based pose estimation runs in real time at 40 Hz on a standard PC, enabling marker-free alignment of digital humans and supporting immersive digital human-augmented robotic telepresence using only head-worn cameras in practical applications.

Our main contributions are summarized as follows:

A working, proof-of-concept prototype of a telepresence system that supports immersive 3D visual and physical interactions simultaneously between remote individuals.
A learning-based, marker-free, full-body pose estimation approach for humanoid robots only using moving head-worn cameras.
The first Humanoid Robot Pose dataset, including full-body joint annotations. (The dataset is available upon request at https://xrlabku.webflow.io/papers/3d-pose-estimation-of-humanoid-robots-using-head-worn-cameras, accessed on 25 September 2024).

2. Related Work

Our research focuses on the intersection of three key areas: human avatars, telepresence, and humanoid robot pose estimation. Numerous studies have explored these areas, paving the way for advancements in our current research. This section provides a brief overview of relevant work, starting with foundational concepts and progressing towards recent developments.

2.1. Human Avatars and Telepresence

Human avatars play an increasingly prominent role in Virtual Reality (VR), telepresence, and human–computer interaction (HCI) applications [15,16,17,18]. Recent research emphasizes creating high-fidelity avatars [19,20] with enhanced realism and expressiveness for user control [21,22,23]. These advancements directly contribute to the goal of telepresence systems, which strive to create a strong sense of presence for the remote user [1,24,25,26,27,28,29]. Notable examples include Telesarphone, a telexistence communication system using retroreflective projection technology [27], and TELESAR V, which facilitates transferring bodily consciousness in telexistence [28]. Similarly, the Beaming system explores asymmetric telepresence, further enhancing user experience in remote environments [29]. Advancements in Augmented Reality (AR), Mixed Reality (MR), and Virtual Reality (VR) technologies further enhance the potential of telepresence systems. Studies like Holoportation [1] and ViGather [26] demonstrate the feasibility of utilizing these advancements to create realistic and immersive telepresence experiences.

Telepresence systems facilitate social interaction between remote individuals by providing virtual human avatars through head-worn displays, representing visual illusions of remote persons. This technology has applications in various fields, including remote consultations and conferences. Research efforts have explored the use of telepresence in these contexts, as evidenced by studies on teleconsultation [24], virtual conferences [25,26], clinical care [30], and telexistence communication [27,28].

However, virtual avatar-based telepresence systems often lack realism without supporting physical interactions. Building on this research, our system leverages XR technologies to enhance the sense of realism in telepresence, offering both visual and physical interactions simultaneously for a more immersive and engaging user experience.

2.2. Robotic Telepresence Systems

Studies on robotic telepresence systems represent an active area of research within the broader field of telepresence. Specifically, Mobile Robotic Telepresence (MRP) systems aim to enhance social interactions among individuals through the use of mobile robots, and they are gaining increasing traction in healthcare applications [4]. These telerobots allow users to navigate remote locations using a robot-mounted camera in real time, and previous studies indicate that such systems can improve the quality of life for individuals with special needs [5]. Early approaches focused on equipping robots with screens displaying the remote user’s face, essentially enabling a robotic video call [31]. However, recent advancements have delved into more sophisticated interaction capabilities.

One such advancement involves enabling physical interactions between remote individuals through haptic devices. This technology allows users to experience the forces and sensations perceived by the robot, thereby enhancing the sense of immersion and realism. Studies by Schwarz et al. [32], Luo et al. [8], and Lenz et al. [33] explore the integration of haptic feedback into telepresence systems, demonstrating its potential to create more realistic and interactive user experiences.

The integration of the remote user’s 3D avatar with the robot’s physical structure enhances telepresence by providing a more lifelike representation of the remote user, thereby improving the sense of presence. Tejwani et al. [12] introduced an Avatar Robot System that overlays a 3D human model onto an upper-body robot through Augmented Reality (AR), leading to improved realism during robot interactions. Similarly, Jones et al. [13] introduced a full-body human avatar onto a telerobot using checkerboard-based pose estimation, which improved user experiences in telepresence interactions. Both studies highlight the significant potential of digital human augmentation to enhance interaction and user satisfaction during telepresence.

Our research approach builds on these advancements by visualizing a digital avatar on a humanoid robot through head-worn displays during robot teleoperation. This integration not only enables the remote user to control the robot’s actions but also facilitates immersive visual interaction for the local user via the digital human-augmented humanoid robot. The combination of visual and physical interactions enhances the system’s realism, allowing the local user to feel more connected to the remote user’s actions and responses. This approach has not been thoroughly explored in previous studies on robotic telepresence, presenting a unique avenue for investigation.

2.3. Human Pose Estimation

Human pose estimation is closely related to humanoid robot pose estimation due to the similarities in joint structures and the shared goal of accurate pose estimation. Advances in deep learning have led to significant progress in human pose estimation, resulting in substantial improvements [2,3,34,35,36,37,38,39]. For example, Eldar Insafutdinov et al. [35] and Zhe Cao et al. [36] significantly improved real-time multi-person 2D pose estimation using Convolutional Neural Networks (CNNs).

Recent approaches in 3D human pose estimation have demonstrated further advances by utilizing parametric body models [38,40], such as SMPL [41]. By fitting these body models onto images, anatomical constraints can be applied, leading to significantly improved accuracy compared to methods based on individual joint detection. However, inferring 3D poses from 2D representations introduces inherent pose ambiguities, which pose challenges in estimating diverse poses from images, particularly in cases of side views or self-occlusions.

To mitigate visibility issues, body-worn sensors such as Inertial Measurement Units (IMUs) have been introduced for pose estimation approaches. These inertial sensors can measure the orientations of body segments regardless of camera visibility. T. von Marcard et al. [42] achieved promising results in full-body 3D pose estimation using only sparse inertial sensors. Yinghao Huang et al. [43] improved this method to work in real-time by utilizing Recurrent Neural Networks (RNNs). However, relying solely on sparse sensors can lead to issues such as IMU drift and pose ambiguities.

Combining visual and inertial sensors can improve pose estimation, even in side-view and occlusion scenarios [44]. However, requiring numerous sensors can be burdensome for users and impractical for everyday applications.

Learning-based human pose estimation methods require large-scale pose datasets for training [45,46]. However, directly applying these techniques to humanoid robots presents challenges due to the lack of large-scale datasets for humanoid robots. Additionally, adaptations are needed to effectively apply learning-based pose estimation to humanoid robots.

2.4. Humanoid Robot Pose Estimation

Due to the distinct characteristics of humans and humanoid robots, directly applying human pose estimation methods to humanoids is impractical. This is because outfitting robots with wearable devices and integrating them into a network imposes additional burdens for conventional applications. Our approach addresses cases where the robot’s appearance may vary, focusing on general applicability as long as the robot’s topology—such as one head, two arms, two legs, and a body similar to human proportions—resembles that of a human being.

Justinas Miseikis et al. [47] have presented a method utilizing convolutional neural networks (CNNs) to detect robot joints. Timothy E. Lee et al. [48] have conducted research on extracting robot keypoints from a single image without the need for markers or offline calibration. Jingpei Lu et al. [49] proposed an algorithm for automatically determining the optimal keypoints for various parts of a robot. Traditional robot pose estimation often focuses on robotic arms rather than full humanoid forms. Furthermore, since robot joints differ from human joints, conventional methods of converting robot poses into human poses are not feasible.

Given the differences between humans and humanoid robots, training models for humanoid robot pose estimation requires a dataset different from those used for human pose estimation. However, due to the scarcity of such datasets, we have developed our own dataset for training purposes. This initiative aims to overcome the limitations of existing research and pave the way for more accurate and adaptable humanoid pose estimation techniques.

3. Digital Human-Augmented Telepresence System

To achieve convenient immersive remote working, our goal is to develop a telepresence system that supports both physical and digital avatars simultaneously. The physical avatar is realized through a humanoid robot operated by a remote user, enabling physical interactions in remote places. The digital avatar is represented as a digital human, providing visual immersion of the remote person by overlaying the digital human onto the robot. An overview of our humanoid robot-aided 3D telepresence system is shown in Figure 2.

Our prototype system comprises two main components located in separate spaces: A user in a remote location, as shown on the left in Figure 2, can perform tasks in the local space through a humanoid robot. A local user, as shown on the right in Figure 2, can physically interact with this robot operated by the remote user and observe the immersive visual avatar of the remote person through a head-mounted display (HMD). Both users are connected wirelessly to a single machine that transmits full-body poses in real time.

At the remote site, the user wears inertial sensors for motion capture and an HMD for visual communication with the local person. The captured motion of the remote user is transmitted wirelessly to both the humanoid robot and the machine for pose estimation of the digital avatar in the target space. The humanoid robot and the digital avatar are deformed based on the received motion through forward kinematics.

The local user is equipped with an HMD for visual communication, augmented with a stereo camera rig to observe the local space. Given the stereo rig imagery, our proposed learning-based robot pose estimation approach estimates the 3D pose of the observed robot in the local space. The estimation is wirelessly sent to the XR headset of the local user, and the HMD visualizes the digital avatar of the remote person in the estimated position and pose so that the visual avatar is overlaid on the humanoid robot from the local user’s perspective for immersive 3D visual interactions. The details of the learning-based humanoid robot pose estimation method are described in Section 4. Additionally, the local user can physically interact with the remote person through the humanoid robot avatar.

3.1. Humanoid Robot

In the proposed telepresence system, the humanoid robot is utilized as a physical human avatar of the remote person for remote physical interactions. To achieve our objective, we implemented humanoid robots, categorized into three versions, as shown in Figure 3.

The first version of the dual-arm robot, shown in Figure 3a, is equipped with 14 degrees of freedom (DoF) for both arms, including 3D shoulders, 1D elbows, and 3D wrists, along with a camera mounted on the head in a fixed location. It employs an EtherCAT network with a 1 kHz communication cycle.

The second version of the humanoid robot, shown in Figure 3b, features a total of 19 DoF. It includes 8 DoF for both arms, comprising 3D shoulders and 1D elbows, and 10 DoF for both legs, consisting of 2D hips, 1D knees, wheels, and anchor supports. Additionally, it has 1 DoF in the waist and a camera on the head. This version uses EtherCAT with a 2 kHz communication cycle to improve stability in motion control.

The third version of the humanoid robot, shown in Figure 3c, is based on the same specifications as the version in Figure 3b, with the addition of the qb SoftHand [50] for both hands, each equipped with 1 DoF, enabling grasping motions to manipulate objects. The SoftHand employs RS485 network communication.

The first two versions of the humanoid robots are used solely for generating the Humanoid Robot Pose dataset, as described in Section 4.1. The third version is used for the demonstration of our remote physical therapy scenario and for the training and evaluation of the robot pose estimation method.

The current version of the robot is in a fixed location; however, we plan to enable mobility in the next version to showcase remote working in various outdoor scenarios.

3.2. Digital Avatar

In our telepresence system, the digital human model acts as a visual human avatar for the visual immersion of the remote user, enhancing the realism of remote work between separate locations. An example of the digital human model is shown in Figure 4.

The remote user’s appearance is prescanned as a digital 3D model in our capture studio, which is equipped with tens of high-resolution cameras [2,3]. The prescanned T-posed 3D human model is transformed into an animatable full-body human model by manually adding rigging information using Blender software [51].

To accurately overlay the digital human model onto the humanoid robot, we adjust for differences in shape and height between the user and the humanoid robot. Using the Unified Robot Description Format (URDF) of our humanoid robot, we manually rescaled the prescanned rigged human model, as shown in Figure 4, improving the 3D alignment between the humanoid robot and the digital human model in the XR headset.

3.3. Robot Teleoperation in a Remote Space

To enable robot teleoperation, a remote user wears inertial sensors to control the robot and a head-mounted display (HMD) for video conferencing, as shown in Figure 5a. By using body-worn inertial sensors, our system avoids the need for complex interfaces typically required in existing teleoperation systems [6,7,8,9,10].

The remote user wears six inertial sensors (Xsens MTw Awinda [54]) on the upper arms, lower arms, and hands to capture upper body motion. These measurements are wirelessly transmitted to a server, which relays the data to the humanoid robot in the local space. The robot continuously adapts its arm poses based on the remote user’s movements.

The remote operator’s pose is represented by the rotations of both arms,

\tilde{R}

, and grab motions, g. The body-worn inertial sensors are calibrated in the initial frames using the methods described in [3,43]. In the beginning, both the remote user and the humanoid robot assume a straight pose for a few seconds to set the inertial measurements,

\tilde{R}

, to identity. At time t,

{\tilde{R}}_{t}

is obtained from the calibrated inertial sensors.

During the calibration process, the sensors on the hands are also set to identity, with the assumption that both the user and robot have their hands open. A binary grab motion indicator,

g_{t} \in {0, 1}

, is triggered when the relative rotation between the hand and lower arm exceeds

45^{\circ}

. The robot then opens or closes its hand based on this grab motion.

For immersive visual interaction with the local environment, the remote user wears an Apple Vision Pro [52] for video conferencing. A GoPro camera [55] mounted on the humanoid robot’s head shares its field of view with the remote user. Using Zoom [56], the remote user can observe the local space from the robot’s perspective and engage in real-time conversations with the local individual. In this prototype, the wearable inertial sensors, the HMD, and the humanoid robot are wirelessly connected, making the system potentially operable in mobile scenarios.

3.4. Digital Human-Augmented Telepresence in a Local Space

To demonstrate the proof-of-concept prototype for digital human-augmented robotic telepresence, the local user wears camera sensors and an HMD for 3D telepresence observation, as shown in Figure 5b. Through this prototype, the local user can observe the digital human overlay representing the remote person while physically interacting with the humanoid robot.

The local user wears an XR headset (XReal Light [53]) equipped with a stereo camera rig for immersive telepresence interactions with the remote user. The stereo rig consists of synchronized miniature stereo cameras (160° FoV, 60 Hz) that provide a wide field of view, enabling full-body visibility of the robot even at close distances, while the high frame rate minimizes motion blur during fast movements. This setup enables immersive telepresence, allowing the user to maintain visual contact with the digital human model overlaid onto the humanoid robot, even during close interactions such as shaking hands. (Please refer to the Supplemental Video).

The stereo cameras are rigidly mounted on the XR headset, and their relative poses are precalibrated using a standard checkerboard calibration [57]. During operation, the stereo camera poses are updated based on the headset’s motion, which is tracked using the built-in Visual SLAM system.

Using the stereo rig’s observations, the 3D pose of the humanoid robot is estimated from the headset’s perspective. Detailed humanoid robot pose estimation is covered in Section 4. The digital human model is then overlaid in the XR headset, aligned with the humanoid robot’s estimated pose for real-time augmented telepresence.

Currently, the stereo rig is connected to the local server via a wired connection. In future iterations, we plan to implement a wireless connection for the stereo rig, similar to other wearable sensors, to ensure the system remains operable in mobile scenarios.

4. Humanoid Robot Pose Estimation Using Head-Worn Cameras

Working toward our goal of simultaneously supporting the immersive visual avatar and the physical avatar, our system overlays the remote user’s digital human model onto the humanoid robot from the local user’s perspective. For widespread usability, we propose a learning-based, marker-less, approach for 3D pose estimation of humanoid robots. Our learning-based method eliminates the need for attaching markers or performing calibration between them. Additionally, estimating from the moving user’s perspective removes the requirement for calibration between the robot’s global space and the observer’s local space. As a result, our approach can be easily extended to other humanoid robots with appropriate pose datasets, and to outdoor, mobile environments. The full-body pose estimation pipeline for 2D and 3D is illustrated in Figure 6 and Figure 7, respectively.

Our method estimates the locations of joints on images, similar to learning-based human pose estimation approaches [58]. We collected a full-body humanoid robot dataset to train the regression-based convolutional neural network (CNN) [59]. To the best of our knowledge, this is the first dataset for full-body humanoid robots. The details of the dataset and 2D joint detection are described in Section 4.1 and Section 4.2 respectively.

The stereo cameras are calibrated with respect to the headset, whose pose is continuously updated via the built-in Visual SLAM. Using the tracked stereo camera poses, the locations of the 3D joints in global space are estimated by triangulating the detected 2D joints from the stereo rigs. The details of the 3D pose estimation are described in Section 4.3.

Vision-based 3D pose estimation is not successful when parts of the humanoid robot are self-occluded or out of the image during close interactions between the user and the robot. Additionally, the estimated headset pose from the built-in Visual SLAM is inaccurate during fast head motions. To compensate for these error sources, we further update the 3D pose of the robot using measurements from the sparse inertial sensors worn by the remote person (only for arms). The inertial sensor measurements can be continuously captured even if body parts are occluded or out of the image. However, the sensor measurements are prone to drift over time, so we update the calibration in real time via the visual–inertial alignment described in Section 4.3. The IMU measurements are used for pose estimation in situations where body parts are unobserved.

The resulting estimated 3D pose of the robot in world space is continuously transmitted to the headset. The digital human model is deformed using this pose and is displayed immersively on the stereo headset display. This allows the local user to see the digital human avatar of the remote user overlaid on the humanoid robot.

4.1. Humanoid Robot Pose Dataset

For widespread acceptability, we decided to exploit markerless, learning-based pose estimation for humanoid robots with our prototype. However, none of the available humanoid robot datasets support full-body pose annotations [14], unlike human pose datasets [45]. Therefore, we collected a new Humanoid Robot Pose dataset for full-body support using the target humanoid robots shown in Figure 3.

We recorded various types of motions from different viewpoints using multiple cameras. For the training dataset, we collected 14 sequences for Robots V1, V2, and V3, as shown in Figure 3, using ten synchronized fixed cameras and three moving cameras. Each image sequence was uniformly subsampled for annotations. The joint locations of the full body were manually annotated using the annotation software [60].

The joint structure includes the nose, neck, shoulders, elbows, wrists, hips, knees, and ankles, for a total of 14 joints. The resulting 15 k dataset for training is summarized in Table 2, which includes the dataset size for each robot type. Random background images, without any objects, were used during training to determine the best configuration of the dataset.

For evaluation, we recorded two sequences using four multiview fixed cameras alongside our moving stereo camera rig, all synchronized in frame capture. The multiview fixed cameras were calibrated to establish a global coordinate system, and the 2D joint locations extracted from their images were used to generate ground-truth 3D joint sets. The camera poses of the stereo cameras, along with their images, served as input for estimating the 3D pose. The estimated 3D pose was then evaluated against the ground-truth 3D joint sets to assess performance.

Manually annotating 2D keypoints in all images is time-consuming, and the human-annotated locations may not be multiview consistent in 3D global space. Therefore, we generated the ground-truth data semi-automatically by training the pose detector separately, using both the training dataset and automatically generated data samples as follows.

Using the method described in Section 4.2, the initial pose detector was trained with the training dataset only. For the images from the four fixed cameras, the initial pose detector was utilized to extract raw 2D joint locations. Using the precalibration of the fixed cameras, initial 3D joint locations were extracted by triangulation from the four viewpoints. By projecting the 3D joints in global space to each fixed camera, globally consistent 2D joint locations were generated. Incorporating these generated samples with the training dataset, we retrained the pose detector to fit the newly added samples. We reprocessed the back-projection using the improved detector and generated more improved samples on the images from the fixed cameras. In the experiment, two rounds of training were sufficient to extract globally and temporally consistent 3D joint sets for evaluation.

Finally, the pose of the moving stereo camera rig was estimated using stereo Visual SLAM over time [61]. We evaluated the proposed method using the registered stereo images with the globally consistent 3D joint dataset. The resulting 2.8 k evaluation dataset is summarized in Table 3. Sequence 1 includes minimal camera movements, and Sequence 2 incorporates fast camera movements to measure performance in different scenarios.

4.2. CNN Joint Detector

To achieve our goal of markerless full-body pose estimation for humanoid robots, we utilize a convolutional neural network (CNN) with our collected dataset, as described in Section 4.1. Specifically, we employ a regression-based convolutional neural network (CNN) [59] paired with a stacked hourglass [58] backend for our joint detector.

The stacked hourglass architecture is highly regarded in the domain of human pose estimation, exhibiting performance metrics that are competitive with those of leading methodologies such as OpenPose [36] and ViTPose [62]. Notably, it has achieved state-of-the-art results on the MPII dataset [46], as highlighted in [63].

To enhance computational efficiency, we adopted the DSNT regression module, which allows the network to directly estimate 2D keypoint coordinates. This approach removes the necessity of parsing heatmaps for coordinate extraction during runtime. Additionally, we train the network similarly to the method described in [3].

4.2.1. Network Structure

The joint detection network

f : I \to K, H

takes an image

I \in R^{m \times m \times 3}

as input (

m = 320

) and outputs joint coordinates

K \in R^{2 \times | K |}

along with confidence maps

H \in R^{(m / 4) \times (m / 4) \times | K |}

, where the number of keypoints

| K | = 14

. We use four stacked stages to balance accuracy and speed. The structure of the multistage network is illustrated in Figure 6.

Figure 6. Network structure for the 2D joint detector: Given a single input image, the 4-stage network outputs the keypoint coordinates K and their confidence maps H. The Hourglass module [58] outputs unnormalized heatmaps

\tilde{H}

, which are propagated to the next stage. The DSNT regression module [59] normalizes

\tilde{H}

and produces H and K.

\tilde{H}

, which are propagated to the next stage. The DSNT regression module [59] normalizes

\tilde{H}

and produces H and K.

In each stage, the Hourglass module [58] infers the unnormalized heatmaps

\tilde{H}

for joints. The DSNT regression module [59] normalizes

\tilde{H}

into H using a Softmax layer. H is then transformed into 2D coordinates K by taking the dot product of the X and Y coordinate matrices.

The j-th joint confidence

c_{j} = H_{j} (x, y)

is read out at the estimated 2D coordinate

k_{j} = (x, y) \in K

. When confidence

c_{j}

is large enough (

c_{j} > t_{c}

, with

t_{c} = 0.1

), the joint

k_{j}

is considered detected, and the joint visibility

v_{j}

is set to 1; otherwise, it is set to 0.

4.2.2. Network Optimization

The joint detector network is trained to minimize the loss function

L_{d e t e c t o r} = L_{D S N T} + L_{V}

, similar to the method in [3].

To estimate 2D coordinates for visible joints where

v_{j} = 1

, the regression loss

L_{D S N T}

is applied as:

L_{D S N T} = \sum_{i = 1}^{| K |} v_{i}^{g t} \cdot [| | k_{i}^{g t} - k_{i} {| |}_{2}^{2} + D (H_{i} | | N (k_{i}^{g t}, σ I_{2}))]

(1)

where the superscript ^gt represents the ground truth data for training, and

v^{g t}

indicates the binary visibility for each joint in the ground truth data.

N (μ, σ)

is a 2D Gaussian map drawn at coordinates

μ

with a 2D standard deviation

σ

. For training,

σ = 1

is used.

D (\cdot | | \cdot)

denotes the Jensen–Shannon divergence, which measures the similarity between the confidence map H and the Gaussian map generated using the ground truth data [59].

In order to discourage false detections, the invisibility loss

L_{V}

is applied to the unnormalized heatmaps

\tilde{H}

as,

L_{V} = \sum_{i = 1}^{| K |} (1 - v_{i}^{g t}) \cdot | | {\tilde{H}}_{i} {| |}_{2}^{2}

(2)

L_{V}

suppresses

\tilde{H}

to a zero heatmap for invisible joint, where

v_{j} = 0

. The zero heatmap results in a uniform distribution in H, which encourages the joint confidence to be smaller for invisible joints.

4.2.3. Training Details

The network is trained in multiple stages. First, it is pretrained on the COCO 2017 Human Pose dataset [45] to learn low-level texture features. Then, it is further trained on our custom dataset, as described in Section 4.1. Intermediate supervision is applied at each stage of the network to prevent vanishing gradients and ensure effective learning [36].

To improve detection robustness, several data augmentation techniques were employed. These include random image transformations such as scaling, rotation, translation, and horizontal flipping. Additionally, random color adjustments, including contrast and brightness changes, were introduced to account for varying lighting conditions. Based on empirical observations, random blurring and noise were also added to address the complexity of robot images, which often include intricate textures such as cables, electronic components, and body covers with bolts and nuts.

The four-stacked hourglass network requires training 13.0 M parameters and is trained for 240 epochs using the RMSprop optimizer. On a machine with an Intel Xeon Silver 4310 processor, 128 GB of RAM, and dual NVIDIA RTX 4090 GPUs, the training process takes approximately 10.5 h.

4.3. Sensor Fusion for 3D Pose Estimation

The full-body pose estimation of the humanoid robot from the local user’s perspective takes three inputs: the partial joint rotations (both arms) of the remote user acquired from body-worn inertial sensors, the headset pose from the XR glasses of the local user, and the stereo images from the camera rig mounted on the headset, as shown in Figure 2 and Figure 5. At runtime, these heterogeneous sensor measurements are utilized to localize and deform the digital human, and the resulting estimated pose provides the local user with an XR overlay on the humanoid robot. The full-body pose of the observed humanoid robot is estimated in six stages, as illustrated in Figure 7.

Figure 7. Full-body 3D pose estimation pipeline: Heterogeneous sensor data from the XR headset and stereo camera rig worn by the local user, combined with inertial sensors worn by the remote person, are fused to estimate the full-body pose of the digital human from the local user’s perspective.

4.3.1. Stereo 2D Keypoints Inference

In the first stage, the stereo camera views, denoted as

I_{L}

and

I_{R}

, mounted on the XR headset are processed simultaneously by a single joint network. The keypoint detector estimates the humanoid robot’s keypoints,

K^{L}

and

K^{R}

, from each of the stereo images, respectively (see Figure 7).

The joint detection network with four stacked stages for inference requires 48.66 GPU floating-point operations per second (GFLOPs). The performance of the joint detector is summarized in Table 4. In our experiments, keypoint detections from the stereo images achieved an average processing speed of 45 FPS.

4.3.2. Stereo Camera Pose Estimation

The poses of the stereo camera rigs,

C_{0}^{L}

and

C_{0}^{R}

, relative to the headset,

C_{0}^{H} = I_{4 \times 4}

, are precalibrated as described in Section 3.4. In the second stage, the moving camera poses,

C_{t}

, are estimated based on the tracked headset pose,

C_{t}^{H}

, from the built-in Visual SLAM, where

C_{t} = C_{t}^{H} \cdot C_{0}

at time t. We utilize the SLAM coordinate system as the global reference frame.

4.3.3. Three-Dimensional Joints Estimation

In the third stage, 3D joint positions in global space

J_{t}

are estimated using the 2D keypoints (

K_{t}^{L}

and

K_{t}^{R}

) and the stereo camera poses (

C_{t}^{L}

and

C_{t}^{R}

). To compute the 3D joint locations

J_{t}

of the humanoid robot in world space, we determine the ray intersections between

K_{t}^{L}

from camera

C_{t}^{L}

and

K_{t}^{R}

from camera

C_{t}^{R}

. We apply the standard Direct Linear Transformation (DLT) algorithm to calculate these intersections, as detailed in [57].

4.3.4. Global Pose Alignment

In the fourth stage, the pose of the root joint of the humanoid robot is estimated in global space using the 3D joint locations

J_{t}

. The root transformation

T_{t}^{0} = [R_{t}^{0} | t_{t}^{0}]

, representing the global rotation

R_{t}^{0}

, and the global translation

t_{t}^{0}

of the digital human is estimated by minimizing the joint position errors,

E_{r o o t}

, between the T-posed initial joint locations

J_{0}

of the digital human and the current estimated 3D joint positions

J_{t}

as,

E_{r o o t} = \sum_{i = 1}^{| K^{'} |} | | T_{t}^{0} \cdot J_{0}^{i} - J_{t}^{i} {| |}_{2}^{2}

(3)

where

K^{'}

are the stationary torso joints, including shoulders, hips, neck, and nose.

T_{t}^{0}

is estimated using the standard point-to-point ICP solver [64].

4.3.5. Visual–Inertial Alignment for Bone Orientations

In the fifth stage, the limb poses are estimated by temporally aligning the 3D joint locations of the local humanoid robot with the inertial measurements received from the remote person.

Although the inertial sensors are initially calibrated, as described in Section 3.3, they tend to experience drift over time. This drift can result in misalignments between the poses estimated from the remote user’s inertial measurements and those observed from the stereo images of the local humanoid robot.

Since the 3D joint estimation method using stereo images (as described in Section 4.3.3) operates on a per-frame basis, it can exhibit significant depth errors even with minimal pixel errors in the 2D keypoint locations. Figure 7 illustrates the depth errors of the estimated 3D joint locations caused by slight inaccuracies in the 2D keypoints. The joint locations (shown in red) should ideally reflect a straight pose, but instead display bent arms due to triangulation errors in the stereo 2D keypoints. This example demonstrates that such errors can be corrected by applying the corrected inertial measurements (shown in yellow) through visual–inertial alignment.

To reduce the misalignment between the inertial measurements and the visual observations, we employ the visual–inertial alignment method proposed in [3]. This method extracts the bone directions

d^{I}

from the inertial sensors and the bone directions

d^{V}

from the estimated 3D joint locations

J_{t}

in world space whenever the corresponding joints are detected. To mitigate the misalignment, the IMU offset rotation

R_{t}^{W}

is estimated so that all inertial bone directions

d_{1, \dots, t}^{I}

align with the corresponding visual bone directions

d_{1, \dots, t}^{V}

A bone direction,

d_{i}

, represents the directional vector from the location of the base (parent) joint to the tip (child) joint.

d_{i}

can be computed using either inertial measurements or using 3D joint locations as,

d_{i}^{I} = {\tilde{R}}_{i}^{[:, 2]}

(4)

d_{i}^{V} = {(T_{t}^{0})}^{- 1} \cdot ((J_{i} - J_{p (i)}) / | | J_{i} - J_{p (i)} {| |}_{2}^{2})

(5)

where

{\tilde{R}}_{i} = [s_{x}, s_{y}, s_{z}] \in R^{3 \times 3}

represents the rotation matrix acquired from the calibrated body-worn inertial sensor.

s_{y} = {\tilde{R}}^{[:, 2]}

denotes the second column vector of the rotation matrix.

p (\cdot)

indicates the parent index of the joint. Note that

d^{I}

and

d^{V}

are represented in the digital human’s local coordinate space, obtained by canceling out the root transformation

T_{t}^{0}

The IMU offset matrix

R^{W}

can be estimated by solving the least square problem:

min_{R^{W}} \sum_{t} | | d_{t}^{V} - R^{W} \cdot d_{t}^{I} {| |}_{2}^{2}

(6)

R^{W}

is computed using the online k-means algorithm, with recent measurements given more weights than past ones to react to sudden sensor drift [3].

4.3.6. Full-Body Motion Data

The full-body pose of the digital human,

T_{t} = {t_{t}^{0}, R_{t}^{V}, R_{t}^{I}}

is represented as a concatenation of the global position

t_{t}^{0}

, rotations of the stationary bones

R_{t}^{V}

, and rotations of the visually aligned moving limb bones

R_{t}^{I}

in the world space of the local person. In the sixth stage, the full-body motion data,

T_{t}

, are updated for network transmission to the XR headset of the local user. The rotations

R_{t}^{V}

and

R_{t}^{I}

are computed as

R_{t}^{V} = T_{t}^{0} \cdot R_{0}

(7)

R_{t}^{I} = T_{t}^{0} \cdot R^{W} \cdot {\tilde{R}}_{t}

(8)

where

R_{0}

is the rotation matrix in the initial T-pose of the digital human model.

4.3.7. XR Visualization

At run-time, the digital human is animated over time according to the estimated 3D pose. The posed digital human is visualized through the XR headset to provide digital human augmentation. Instead of wirelessly transmitting the entire surface of the digital human every frame, only a much smaller-sized full-body pose data is sent between the pose estimation server and the XR headset. This approach ensures efficient network usage for the digital human’s deformation over time.

At runtime, the estimated full-body pose of the digital human,

T_{t}

, is continuously sent to the XR headset of the local user. Within the XR headset, the digital human is deformed based on

T_{t}

and visualized at the global position

t_{t}^{0}

, achieving a real-time XR overlay of the digital human onto the humanoid robot. In experiments, the full-body pose estimation process, including CNN joint detection, achieves an average frame rate of 40 FPS.

4.3.8. Implementation Details

The 3D joint location estimation operates on a per-frame basis, as described in Section 4.3.3. In the current setup with humanoid robot V3, shown in Figure 3c, the robot’s root transformation remains stationary in world space, and its lower body’s pose is fixed. These relaxed constraints are applied to the 3D pose estimation solver, enabling the inclusion of temporal information.

When estimating 3D joint positions

J_{t}

, the knees and ankles are additionally included in the stationary joint set

K^{'}

. The online averaged 3D joint locations

J_{a v g}

for these fixed torso joints are computed, assuming they stay stationary in world space. This constraint reduces motion jitter, allowing the root transformation

T_{t}^{0}

to converge rapidly, typically within a hundred frames. Future work aims to extend this method to walkable humanoid robots by applying temporal optimization techniques from [3].

The full-body pose estimation method is designed to handle occlusions and out-of-frame challenges. When limb joints are not visible in stereo images, the corresponding 2D keypoints may be missing, causing failures in 3D joint estimation as detailed in Section 4.3.3. However, the visual–inertial alignment, described in Section 4.3.5, allows for limb orientation estimation in Equation (8), enabling continued operation despite partial occlusions.

In contrast, the system encounters failure if torso joints, such as the shoulders or hips, are unobserved. The absence of these joints affects the root transformation

T_{t}^{0}

, due to insufficient correspondences for the ICP solver, as explained in Equation (3).

5. Results and Evaluation

In this section, we present evaluations of our humanoid robot 3D pose estimation method for real-time 3D telepresence. Additionally, we showcase results for a potential application in remote physical training, demonstrating how our telepresence system provides immersive visual interactions and physical manipulations in remote work scenarios.

5.1. Ablation Study on 2D Joint Detectors for Humanoid Robots

In this section, we evaluate the performance of our CNN-based joint detector (described in Section 4.2) using the HRP training dataset. The detector was trained on this dataset and validated using the HRP validation dataset, as outlined in Table 2.

To ensure robust performance, the joint detector was trained using various dataset configurations: images from the V3 robot only, images from all robot versions (V1–V3), and images with random backgrounds. These configurations are detailed in Table 4.

For evaluation, we use the Percentage of Correct Keypoints (PCK) as our metric, which is consistent with previous human pose estimation work [58,62,65,66]. The MPII dataset [46] commonly uses [email protected], where predicted joints within a distance of

0.2 \times

the torso diameter are considered correct.

For a stricter evaluation of our HRP validation dataset, performance was assessed using PCK across multiple thresholds (0.1, 0.05, 0.025, 0.0125, 0.00625), and the mean PCK (mPCK) was calculated across these thresholds.

The results of 2D pose estimation using different dataset configurations are summarized in Table 4. The evaluation shows that using images from all robot versions (V1–V3) provided slightly better performance compared with other configurations, while adding random background images did not result in performance improvement. Nonetheless, all configurations achieved over 97% accuracy with [email protected], which corresponds to an average pixel error of 6.22 pixels.

For the remainder of the experiments, the joint detector trained on Dataset C (Robot V1 + V2 + V3), as shown in Table 4, is selected as the optimal detector.

Figure 8 shows the 2D pose estimation results from various scenarios using head-worn cameras. The results demonstrate that our joint detector effectively detects full-body joints, even during fast head motions, under significant motion blur, or when parts of the body are out of frame or at close distances.

5.2. Performance Evaluation of 3D Pose Estimation for Humanoid Robots

Our learning-based full-body 3D pose estimation method for humanoid robots is not directly comparable to any previous approaches we are aware of. Prior robot pose estimation methods [12,13] only work with specific types of robots with markers and do not function with generic full-body humanoid robots without markers. Learning-based humanoid robot pose estimation methods [14,49] are unable to estimate 2D keypoints for full-body joints. Prior full-body human pose estimation methods [3,67,68,69] do not work with humanoid robot imagery.

Our 3D pose estimation method, denoted as

O u r s

, uses head-worn stereo views, head tracking, and inertial measurements from the upper arms and wrists to estimate the full-body pose for XR overlay. We evaluate our method on the Humanoid Robot Pose (HRP) dataset (see Section 4.1) and compare it against the baseline approaches detailed in Section 5.2.1.

The evaluation is conducted in a lab environment under the same lighting conditions as the training set, with pose estimation performance assessed in world space (see Section 5.2.2) and in the head-worn camera space (see Section 5.2.3).

5.2.1. Baseline Approaches

The Motion Capture system, referred to as

M o c a p

, serves as a baseline approach that assumes all relevant information is consistently provided over time. The humanoid robot’s arm poses are given by motion capture data from inertial calibrated sensors, as described in Section 3.3. The lower body pose, root rotation, and position are also provided using ground truth (GT) data from the first frame. However, this strong assumption limits the method’s applicability to specific robots and locations. In contrast,

O u r s

requires no prior knowledge and can be applied to any type of robot or environment.

Smplify [40] is a single-image-based method that estimates the full-body pose using 2D keypoints and the SMPL body model [41]. For a fair comparison, we provide Smplify with keypoints detected by our joint detector (see Section 5.1). However, Smplify’s single-image approach lacks depth information, resulting in significant inaccuracies in 3D body root position estimates and a lack of multiview consistency. To mitigate this, we replace Smplify’s raw body root positions with ground truth (GT) data (see Section 4.1). We denote

S m p l i f y_{L}

for the left stereo image with GT root positions and

S m p l i f y_{R}

for the right stereo image with GT root positions.

Our 3D pose estimation runs at an average of 40 FPS on a desktop PC equipped with an AMD Ryzen 9 5900X processor (3.70 GHz), 64 GB RAM, and NVIDIA GeForce RTX 3080 Ti GPU. In comparison,

S m p l i f y

takes approximately 45 s to process a single view.

5.2.2. Evaluation in World Space

We conducted evaluations using the HRP dataset’s evaluation set, as described in Table 3, which contains 3D humanoid robot motions under different headset movements. Sequence 1 features minimal camera movements, while Sequence 2 involves rapid camera movements. The estimated full-body 3D pose of each method is retargeted to the SMPL body model [41], and the results are overlaid from another viewpoint to demonstrate accuracy in 3D. The shape parameters of SMPL are estimated from the detected 3D joint positions using the method described in [3]. Figure 9 presents qualitative comparisons of full-body pose estimation in world space between

M o c a p

S m p l i f y_{L}

, and

O u r s

. Each frame selected from Sequence 2 shows an overlaid image from a fixed-world viewpoint for each method. For detailed comparative results over time, please refer to the Supplemental Video.

To assess the accuracy of our 3D pose estimation, we compared its performance with baseline approaches using ground truth data, measuring 3D joint position and orientation errors in world space. The overall results for body part evaluations are presented in Table 5 and Table 6, with detailed average joint orientation and position errors for each bone in Sequence 1, Sequence 2, and overall.

In all sequences shown in Table 5, our method consistently achieves significantly lower joint orientation errors compared with baseline approaches. While the

M o c a p

method uses an initial calibration between inertial sensors and the remote user in a straight pose, its arm orientation errors are notably higher than for the lower body. This discrepancy is primarily due to the lack of real-time adjustments in

M o c a p

, which leads to misalignments over time as the remote user and local humanoid robot perform similar motions.

In contrast, the arm orientation errors in

O u r s

are significantly reduced, thanks to the continuous calibration updates provided by the visual–inertial alignment process described in Section 4.3. This approach enables our method to outperform

M o c a p

, especially in terms of upper-body joint orientation accuracy. Additionally, our method estimates the lower-body pose with an accuracy comparable to

M o c a p

, despite not relying on external knowledge or ground truth (GT) data for root orientation and position. This achievement highlights the robustness of our system, as illustrated in Figure 9.

S m p l i f y_{L}

and

S m p l i f y_{R}

exhibit substantially higher errors compared with

O u r s

, primarily due to inaccurate root joint orientation estimation, especially when the robot is not facing forward or is off-center in the images. Since SMPLify is mainly trained on human bodies for shape and pose priors, the inaccuracies in SMPLify likely arise from significant differences between the joint placements of the humanoid robot and those of humans. Additionally, SMPLify displays considerable motion jitter over time, whereas the temporal stability of

O u r s

can be attributed to our 3D joint estimation method, which leverages temporal information, as described in Section 4.3.8.

In the evaluation of per-joint 3D position errors presented in Table 6, our method, which utilizes visual–inertial alignments, demonstrates significantly higher accuracy for arm joints (elbows and wrists) compared to other baseline approaches. However, for torso and lower body joints (neck, shoulders, hips, knees, and ankles), our method shows slightly lower accuracy than the baseline Motion Capture (Mocap) system. The accuracy of these joints is highly dependent on the precision of the root transformation, as defined by Equation (7). It is important to note that the root transformation of the Mocap system is derived from ground truth (GT) data, whereas our method estimates the root without relying on any external information.

Overall, the results in world space show that the full-body pose estimation by our method, which utilizes our own joint detector, maintains multi-view-consistent root joint estimations without using any prior knowledge, whereas

M o c a p

and

S m p l i f y

require substantial ground truth data for root joint.

5.2.3. Evaluation on Head-Worn Views

To evaluate the performance of the digital human overlay, we used the evaluation set of the HRP dataset, which includes 2D humanoid robot poses captured from moving stereo camera images across two test sequences. The estimated full-body 3D pose from each method is projected onto the head-worn camera’s viewpoint, with the results overlaid to assess the accuracy of the digital human overlay. Figure 10 presents qualitative comparisons of full-body pose estimation in head-worn camera space between

M o c a p

S m p l i f y_{L}

, and

O u r s

. Each frame corresponds to the frames in Figure 9, displaying the overlaid left image from the stereo rig. For detailed comparative results over time, please refer to the Supplemental Video.

To assess the accuracy of our 3D pose estimation for digital human overlay, we compared its performance against baseline approaches using ground truth data, measuring the Percentage of Correct Keypoints (PCK) in head-worn camera space. The overall results for body part evaluations are presented in Table 7, detailing the mean PCK for each joint in Sequence 1, Sequence 2, and overall. The mPCK was computed across eight PCK thresholds (0.03, 0.04, …, 0.1).

In the per-joint 2D position error evaluation in Table 7,

O u r s

, utilizing visual–inertial alignments, demonstrates significantly higher accuracy for arm joints (elbows and wrists) compared with other baseline approaches. The 2D positions of the torso and lower body (neck, shoulders, hips, knees, and ankles) estimated by

O u r s

show comparable accuracy to

M o c a p

The results in the head-worn view space indicate that our method consistently outperforms the baseline approaches, delivering the best digital human overlay results without relying on any prior knowledge, while both

M o c a p

and

S m p l i f y

require substantial ground truth data for accurate digital human overlay.

5.3. Application

In this section, we demonstrate a remote physical training (PT) scenario in Extended Reality (XR) to showcase the capability of our 3D telepresence system for real-time remote visual and physical interactions. The detailed system setup is described in Section 3.3 and Section 3.4.

The remote trainer guides the local user through physical training motions interactively, providing immediate audio feedback to enhance performance. Our system enables the remote trainer to control the humanoid robot using only body-worn sensors, eliminating the need for complex manipulation via external controllers. With wireless body-worn sensors and an XR headset, motion capture becomes location-independent, allowing free movement in any environment.

Additionally, our prototype robot is capable of grasping and manipulating objects using the SoftHand, as described in Section 3.1. The remote user wears additional IMUs on both hands to control the robot’s grasping motions. When the user’s fingers grasp, the relative rotations between the wrist and hand sensors are detected. (See Section 3.3).

Sample results of the remote robot’s operations, including grasping motions, are shown in Figure 1 (bottom row, 1 × 3 image group). Further demonstrations of remote physical interactions, such as object handovers between the robot and the local user, can be viewed in the Supplemental Video.

The local user interacts visually with the remote trainer through the overlaid digital human on the robot, enhancing the sense of realism and immersion. Figure 11 presents selected frames of the digital human overlay as viewed through the head-mounted display worn by the local user. These results, captured during live demonstrations under lighting conditions different from those in the training data, highlight the system’s ability to function effectively in various environments.

5.4. Network Efficiency

Instead of transmitting a posed high-quality surface of the digital human in every frame, our pose estimation system wirelessly sends a much smaller-sized pose data packet to the XR headset for visual augmentation onto the humanoid robot, as illustrated in Figure 2 and Figure 7. The size of the network packet is crucial in determining overall system latency, ensuring efficient network usage for pose data transformation.

For network transmission, the full-body motion data (as described in Section 4.3.6) are encoded into a fixed-sized packet, which includes the pose of the upper body, a binary grab indicator for both hands (as described in Section 3.3), and the root transformation. The pose of the lower body remains fixed in the current prototype due to relaxed constraints, as outlined in Section 4.3.8, and is therefore excluded from the packet.

In summary, the packet comprises the upper body pose (four bones for each arm represented in quaternions, totaling

4 \times 4 = 16

bytes), the binary grab indicator for both hands (four labels stored in one byte), and the root transformation (which includes the rotation in a quaternion and the 3D position, totaling

4 + 3 = 7

bytes). This results in a total packet size of 24 bytes.

The use of small, fixed-length packets enhances network efficiency. In our current prototype system, each packet is transmitted from the pose server to the headset using TCP socket networking.

5.5. System Latency

Our prototype system utilizes various sensor data for real-time telepresence. The inertial sensor-based motion capture for the remote user is updated at 60 Hz, and the captured motion is transmitted wirelessly to the humanoid robot, which updates its pose at 120 Hz. Fast motions exhibit a delay of approximately 0.5 s in the robot’s teleoperation.

The remote operator uses video conferencing software [56] to share the robot’s view and communicate with the local user. This also introduces a delay of about 0.5 s, which is the main source of latency in our system. In future versions, we plan to develop a dedicated video conferencing system to reduce this latency.

As the current XR headset [53] cannot handle deep learning inference, we use an external processing machine in our setup. The pose estimation server receives motion capture data (60 Hz, wireless), stereo images (60 Hz, wired), and headset pose data (60 Hz, wireless) as input. The unsynchronized sensor data are processed whenever all three inputs are received by the server, with a frame offset of up to 16 ms in the current setup.

Our pose estimation system runs at 40 FPS (Section 4.3) and continuously transmits the updated pose of the digital human to the XR headset. This external processing setup introduces a few frames of delay. In the next iteration, we plan to develop an on-device deep learning pose estimator, which we expect will significantly reduce wireless network delay.

6. Limitation and Future Work

The primary limitation of our approach lies in pose estimation inaccuracy, particularly for arm movements. The ray triangulation-based 3D joint estimation, as described in Section 4.3.4, which relies on accurate 2D keypoints from stereo images, is sensitive to even small 2D errors, resulting in inaccurate 3D joint estimates. To address this, we plan to incorporate temporally consistent joint optimization as proposed in [3].

Additionally, wireless delays in sensor measurements from VSLAM and IMUs further contribute to pose estimation inaccuracies, which can lead to misaligned XR overlay visualizations from the user’s perspective. In future versions, we aim to improve sensor synchronization to reduce system latency.

Our current system’s limitations highlight several areas for future research. One promising direction is developing an on-device pose estimator. The current XR headset lacks the capability to handle deep learning inference, necessitating an external machine and introducing network-related latencies. An on-device solution would enable the entire system, including deep learning inference, to run directly on the XR device, potentially leading to significant performance improvements.

Another area of interest is extending the approach to walking robots, which could enhance 3D telepresence and support remote work in outdoor environments. In future iterations, we plan to improve the mobility of the humanoid robot to enable remote working in a broader range of scenarios.

The current pose estimation method is designed to detect a single robot. Expanding this approach to support multiple remote working robots is another interesting direction for future research.

7. Conclusions

We introduced a real-time 3D telepresence system utilizing a humanoid robot as a step toward XR-based remote working. Our system facilitates physical interaction through the humanoid robot while providing visual immersion via an overlaid digital human avatar. To address the challenge of consistently synchronizing avatars, we proposed a markerless 3D pose estimation method specifically designed for humanoid robots, leveraging our newly collected dataset.

Our results demonstrate the robustness and consistency of this method in estimating full-body poses for humanoid robots using head-worn cameras, without relying on external knowledge such as the robot’s global position or full-body pose. We showcased the system’s potential through an application in remote physical training, highlighting the effectiveness of simultaneous visual and physical interactions using an XR headset.

We envision our prototype evolving into a fully mobile system, featuring an XR headset integrated with an on-device deep learning pose estimator. This development would eliminate the need for an external machine and significantly reduce system latency. Future iterations will aim to support both mobile and multiple humanoid robots, enhancing the utility and productivity of our telepresence prototype across various remote working tasks.

Supplementary Materials

The following supporting information can be downloaded at https://xrlabku.webflow.io/papers/3d-pose-estimation-of-humanoid-robots-using-head-worn-cameras (accessed on 25 September 2024): demonstration videos and the humanoid robot pose dataset.

Author Contributions

Conceptualization, Y.L., H.L. and Y.C. (Youngwoon Cha); Methodology, Y.C. (Youngdae Cho) and Y.C. (Youngwoon Cha); Software, Y.C. (Youngdae Cho), W.S., J.B. and Y.C. (Youngwoon Cha); Validation, Y.C. (Youngdae Cho), H.L. and Y.C. (Youngwoon Cha); Formal analysis, Y.C. (Youngdae Cho) and Y.C. (Youngwoon Cha); Investigation, Y.C. (Youngdae Cho), W.S., J.B. and Y.C. (Youngwoon Cha); Resources, Y.L., H.L. and Y.C. (Youngwoon Cha); Data curation, Y.C. (Youngdae Cho), W.S. and J.B.; Writing—original draft, Y.C. (Youngdae Cho), J.B. and Y.C. (Youngwoon Cha); Writing—review & editing, Y.C. (Youngwoon Cha); Visualization, Y.C. (Youngdae Cho); Supervision, Y.L., H.L. and Y.C. (Youngwoon Cha); Project administration, Y.C. (Youngwoon Cha); Funding acquisition, Y.C. (Youngwoon Cha). All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2023-00214511), by Korea Institute of Science and Technology (KIST) Institutional Program (2E33003 and 2E33000), and by Institute of Information & communications Technology Planning & Evaluation (IITP) under the metaverse support program to nurture the best talents (IITP-2024-RS-2023-00256615) grant funded by the Korea government (MSIT).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request. The data presented in this study will be available soon on request from the corresponding author. (starting late September, 2024). Please refer to the project website at https://xrlabku.webflow.io/papers/3d-pose-estimation-of-humanoid-robots-using-head-worn-cameras (accessed on 25 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Orts-Escolano, S.; Rhemann, C.; Fanello, S.; Chang, W.; Kowdle, A.; Degtyarev, Y.; Kim, D.; Davidson, P.L.; Khamis, S.; Dou, M.; et al. Holoportation: Virtual 3d teleportation in real-time. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology, Tokyo, Japan, 16–19 October 2016; pp. 741–754. [Google Scholar]
Cha, Y.W.; Price, T.; Wei, Z.; Lu, X.; Rewkowski, N.; Chabra, R.; Qin, Z.; Kim, H.; Su, Z.; Liu, Y.; et al. Towards fully mobile 3D face, body, and environment capture using only head-worn cameras. IEEE Trans. Vis. Comput. Graph. 2018, 24, 2993–3004. [Google Scholar] [CrossRef]
Cha, Y.W.; Shaik, H.; Zhang, Q.; Feng, F.; State, A.; Ilie, A.; Fuchs, H. Mobile. Egocentric Human Body Motion Reconstruction Using Only Eyeglasses-mounted Cameras and a Few Body-worn Inertial Sensors. In Proceedings of the 2021 IEEE Virtual Reality and 3D User Interfaces (VR), Lisboa, Portugal, 27 March–1 April 2021; pp. 616–625. [Google Scholar]
Kristoffersson, A.; Coradeschij, S.; Loutfi, A. A review of mobile robotic telepresence. Adv. Hum.-Comput. Interact. 2013, 2013, 902316. [Google Scholar] [CrossRef]
Zhang, G.; Hansen, J.P. Telepresence robots for people with special needs: A systematic review. Int. J. Hum.-Comput. Interact. 2022, 38, 1651–1667. [Google Scholar] [CrossRef]
Aymerich-Franch, L.; Petit, D.; Ganesh, G.; Kheddar, A. Object touch by a humanoid robot avatar induces haptic sensation in the real hand. J. Comput.-Mediat. Commun. 2017, 22, 215–230. [Google Scholar] [CrossRef]
Bremner, P.; Celiktutan, O.; Gunes, H. Personality perception of robot avatar tele-operators. In Proceedings of the 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Christchurch, New Zealand, 7–10 March 2016; pp. 141–148. [Google Scholar]
Luo, R.; Wang, C.; Schwarm, E.; Keil, C.; Mendoza, E.; Kaveti, P.; Alt, S.; Singh, H.; Padir, T.; Whitney, J.P. Towards robot avatars: Systems and methods for teleinteraction at avatar xprize semi-finals. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 7726–7733. [Google Scholar]
Khatib, O.; Yeh, X.; Brantner, G.; Soe, B.; Kim, B.; Ganguly, S.; Stuart, H.; Wang, S.; Cutkosky, M.; Edsinger, A.; et al. Ocean one: A robotic avatar for oceanic discovery. IEEE Robot. Autom. Mag. 2016, 23, 20–29. [Google Scholar] [CrossRef]
Hauser, K.; Watson, E.N.; Bae, J.; Bankston, J.; Behnke, S.; Borgia, B.; Catalano, M.G.; Dafarra, S.; van Erp, J.B.; Ferris, T.; et al. Analysis and perspectives on the ana avatar xprize competition. Int. J. Soc. Robot. 2024, 1–32. [Google Scholar] [CrossRef]
Double. Available online: https://www.doublerobotics.com/ (accessed on 25 September 2024).
Tejwani, R.; Ma, C.; Bonato, P.; Asada, H.H. An Avatar Robot Overlaid with the 3D Human Model of a Remote Operator. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 7061–7068. [Google Scholar] [CrossRef]
Jones, B.; Zhang, Y.; Wong, P.N.; Rintel, S. Belonging there: VROOM-ing into the uncanny valley of XR telepresence. Proc. ACM Hum.-Comput. Interact. 2021, 5, 1–31. [Google Scholar] [CrossRef]
Amini, A.; Farazi, H.; Behnke, S. Real-time pose estimation from images for multiple humanoid robots. In RoboCup 2021: Robot World Cup XXIV; Alami, R., Biswas, J., Cakmak, M., Obst, O., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 91–102. [Google Scholar]
Latoschik, M.E.; Roth, D.; Gall, D.; Achenbach, J.; Waltemate, T.; Botsch, M. The effect of avatar realism in immersive social virtual realities. In Proceedings of the 23rd ACM Symposium on Virtual Reality Software and Technology, Gothenburg, Sweden, 8–10 November 2017; pp. 1–10. [Google Scholar]
Choi, Y.; Lee, J.; Lee, S.H. Effects of locomotion style and body visibility of a telepresence avatar. In Proceedings of the 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), Atlanta, GA, USA, 22–26 March 2020; pp. 1–9. [Google Scholar]
Aseeri, S.; Interrante, V. The Influence of Avatar Representation on Interpersonal Communication in Virtual Social Environments. IEEE Trans. Vis. Comput. Graph. 2021, 27, 2608–2617. [Google Scholar] [CrossRef]
Fribourg, R.; Argelaguet, F.; Lécuyer, A.; Hoyet, L. Avatar and sense of embodiment: Studying the relative preference between appearance, control and point of view. IEEE Trans. Vis. Comput. Graph. 2020, 26, 2062–2072. [Google Scholar] [CrossRef]
Liao, T.; Zhang, X.; Xiu, Y.; Yi, H.; Liu, X.; Qi, G.J.; Zhang, Y.; Wang, X.; Zhu, X.; Lei, Z. High-Fidelity Clothed Avatar Reconstruction From a Single Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 8662–8672. [Google Scholar]
Zhao, X.; Wang, L.; Sun, J.; Zhang, H.; Suo, J.; Liu, Y. HAvatar: High-fidelity Head Avatar via Facial Model Conditioned Neural Radiance Field. ACM Trans. Graph. 2023, 43, 1–16. [Google Scholar] [CrossRef]
Thies, J.; Zollhöfer, M.; Nießner, M.; Valgaerts, L.; Stamminger, M.; Theobalt, C. Real-time expression transfer for facial reenactment. ACM Trans. Graph. 2015, 34, 183-1. [Google Scholar] [CrossRef]
Shen, K.; Guo, C.; Kaufmann, M.; Zarate, J.J.; Valentin, J.; Song, J.; Hilliges, O. X-Avatar: Expressive Human Avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 16911–16921. [Google Scholar]
Gafni, G.; Thies, J.; Zollhofer, M.; Niessner, M. Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 8649–8658. [Google Scholar]
Yu, K.; Gorbachev, G.; Eck, U.; Pankratz, F.; Navab, N.; Roth, D. Avatars for Teleconsultation: Effects of Avatar Embodiment Techniques on User Perception in 3D Asymmetric Telepresence. IEEE Trans. Vis. Comput. Graph. 2021, 27, 4129–4139. [Google Scholar] [CrossRef] [PubMed]
Panda, P.; Nicholas, M.J.; Gonzalez-Franco, M.; Inkpen, K.; Ofek, E.; Cutler, R.; Hinckley, K.; Lanier, J. Alltogether: Effect of avatars in mixed-modality conferencing environments. In Proceedings of the 1st Annual Meeting of the Symposium on Human-Computer Interaction for Work, Durham, NH, USA, 8–9 June 2022; pp. 1–10. [Google Scholar]
Qiu, H.; Streli, P.; Luong, T.; Gebhardt, C.; Holz, C. ViGather: Inclusive Virtual Conferencing with a Joint Experience Across Traditional Screen Devices and Mixed Reality Headsets. Proc. ACM Hum.-Comput. Interact. 2023, 7, 1–27. [Google Scholar] [CrossRef]
Tachi, S.; Kawakami, N.; Nii, H.; Watanabe, K.; Minamizawa, K. Telesarphone: Mutual telexistence master-slave communication system based on retroreflective projection technology. SICE J. Control Meas. Syst. Integr. 2008, 1, 335–344. [Google Scholar] [CrossRef]
Fernando, C.L.; Furukawa, M.; Kurogi, T.; Kamuro, S.; Minamizawa, K.; Tachi, S. Design of TELESAR V for transferring bodily consciousness in telexistence. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 5112–5118. [Google Scholar]
Steed, A.; Steptoe, W.; Oyekoya, W.; Pece, F.; Weyrich, T.; Kautz, J.; Friedman, D.; Peer, A.; Solazzi, M.; Tecchia, F.; et al. Beaming: An asymmetric telepresence system. IEEE Comput. Graph. Appl. 2012, 32, 10–17. [Google Scholar] [CrossRef] [PubMed]
Hilty, D.M.; Randhawa, K.; Maheu, M.M.; McKean, A.J.; Pantera, R.; Mishkind, M.C.; Rizzo, A.S. A review of telepresence, virtual reality, and augmented reality applied to clinical care. J. Technol. Behav. Sci. 2020, 5, 178–205. [Google Scholar] [CrossRef]
Tsui, K.M.; Desai, M.; Yanco, H.A.; Uhlik, C. Exploring use cases for telepresence robots. In HRI ’11, Proceedings of the 6th International Conference on Human-Robot Interaction, Lausanne, Switzerland, 6–9 March 2011; Association for Computing Machinery: New York, NY, USA, 2011; pp. 11–18. [Google Scholar] [CrossRef]
Schwarz, M.; Lenz, C.; Rochow, A.; Schreiber, M.; Behnke, S. NimbRo Avatar: Interactive Immersive Telepresence with Force-Feedback Telemanipulation. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 5312–5319. [Google Scholar] [CrossRef]
Lenz, C.; Behnke, S. Bimanual telemanipulation with force and haptic feedback through an anthropomorphic avatar system. Robot. Auton. Syst. 2023, 161, 104338. [Google Scholar] [CrossRef]
Toshev, A.; Szegedy, C. DeepPose: Human Pose Estimation via Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; Schiele, B. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In Computer Vision—ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part VI 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 34–50. [Google Scholar]
Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Pavlakos, G.; Zhu, L.; Zhou, X.; Daniilidis, K. Learning to Estimate 3D Human Pose and Shape From a Single Color Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A.A.A.; Tzionas, D.; Black, M.J. Expressive Body Capture: 3D Hands, Face, and Body From a Single Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Kocabas, M.; Athanasiou, N.; Black, M.J. VIBE: Video Inference for Human Body Pose and Shape Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Bogo, F.; Kanazawa, A.; Lassner, C.; Gehler, P.; Romero, J.; Black, M.J. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Computer Vision—ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part V 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 561–578. [Google Scholar]
Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graph. (TOG) 2015, 34, 1–16. [Google Scholar] [CrossRef]
von Marcard, T.; Rosenhahn, B.; Black, M.; Pons-Moll, G. Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs. Comput. Graph. Forum 2017, 36, 349–360. [Google Scholar] [CrossRef]
Huang, Y.; Kaufmann, M.; Aksan, E.; Black, M.J.; Hilliges, O.; Pons-Moll, G. Deep Inertial Poser Learning to Reconstruct Human Pose from SparseInertial Measurements in Real Time. ACM Trans. Graph. (TOG) 2018, 37, 1–15. [Google Scholar]
von Marcard, T.; Henschel, R.; Black, M.; Rosenhahn, B.; Pons-Moll, G. Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Computer Vision—ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part V 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3686–3693. [Google Scholar]
Miseikis, J.; Knobelreiter, P.; Brijacak, I.; Yahyanejad, S.; Glette, K.; Elle, O.J.; Torresen, J. Robot localisation and 3D position estimation using a free-moving camera and cascaded convolutional neural networks. In Proceedings of the 2018 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Auckland, New Zealand, 9–12 July 2018; pp. 181–187. [Google Scholar]
Lee, T.E.; Tremblay, J.; To, T.; Cheng, J.; Mosier, T.; Kroemer, O.; Fox, D.; Birchfield, S. Camera-to-robot pose estimation from a single image. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 9426–9432. [Google Scholar]
Lu, J.; Richter, F.; Yip, M.C. Pose estimation for robot manipulators via keypoint optimization and sim-to-real transfer. IEEE Robot. Autom. Lett. 2022, 7, 4622–4629. [Google Scholar] [CrossRef]
qb SoftHand Research. Available online: https://qbrobotics.com/product/qb-softhand-research/ (accessed on 25 September 2024).
Blender. Available online: https://www.blender.org/ (accessed on 25 September 2024).
Apple Vision Pro. Available online: https://www.apple.com/apple-vision-pro/ (accessed on 25 September 2024).
XReal Light. Available online: https://www.xreal.com/light/ (accessed on 25 September 2024).
Xsens MTw Ainda. Available online: https://www.movella.com/products/wearables/xsens-mtw-awinda/ (accessed on 25 September 2024).
GoPro. Available online: https://gopro.com/ (accessed on 25 September 2024).
Zoom. Available online: https://zoom.us/ (accessed on 25 September 2024).
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Computer Vision—ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part VIII 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 483–499. [Google Scholar]
Nibali, A.; He, Z.; Morgan, S.; Prendergast, L. Numerical coordinate regression with convolutional neural networks. arXiv 2018, arXiv:1801.07372. [Google Scholar]
Available online: https://darwin.v7labs.com/ (accessed on 25 September 2024).
Sumikura, S.; Shibuya, M.; Sakurada, K. OpenVSLAM: A versatile visual SLAM framework. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2292–2295. [Google Scholar]
Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. Vitpose: Simple vision transformer baselines for human pose estimation. Adv. Neural Inf. Process. Syst. 2022, 35, 38571–38584. [Google Scholar]
Lovanshi, M.; Tiwari, V. Human pose estimation: Benchmarking deep learning-based methods. In Proceedings of the 2022 IEEE Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation (IATMSI), Gwalior, India, 21–23 December 2022; pp. 1–6. [Google Scholar]
Besl, P.J.; McKay, N.D. Method for registration of 3-D shapes. In Sensor Fusion IV: Control Paradigms and Data Structures; Schenker, P.S., Ed.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 1992; Volume 1611, pp. 586–606. [Google Scholar] [CrossRef]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Artacho, B.; Savakis, A. Omnipose: A multi-scale framework for multi-person pose estimation. arXiv 2021, arXiv:2103.10180. [Google Scholar]
Shimada, S.; Golyanik, V.; Xu, W.; Theobalt, C. Physcap: Physically plausible monocular 3d motion capture in real time. ACM Trans. Graph. (ToG) 2020, 39, 1–16. [Google Scholar] [CrossRef]
Yi, X.; Zhou, Y.; Habermann, M.; Golyanik, V.; Pan, S.; Theobalt, C.; Xu, F. EgoLocate: Real-time motion capture, localization, and mapping with sparse body-mounted sensors. ACM Trans. Graph. (TOG) 2023, 42, 1–17. [Google Scholar] [CrossRef]
Winkler, A.; Won, J.; Ye, Y. Questsim: Human motion tracking from sparse sensors with simulated avatars. In Proceedings of the SIGGRAPH Asia 2022 Conference Papers, Daegu, Republic of Korea, 6–9 December 2022; pp. 1–8. [Google Scholar]

Figure 1. We present a digital human-augmented robotic telepresence system that overlays a remote person onto a humanoid robot using only head-worn cameras worn by the local user. This approach enables immersive 3D telepresence for continuously moving observers during robot teleoperation without the need for markers or knowledge of the robot’s status. Top left: The local user interacts visually with the digital avatar through a head-mounted display (HMD) while physically interacting with the humanoid robot, facilitating both visual and physical immersion. Top right: The remote user communicates via video conferencing through the HMD. Bottom row,

1 \times 3

image group: The remote user operates the robot using wireless motion capture.

1 \times 3

image group: The remote user operates the robot using wireless motion capture.

Figure 2. Overview of our digital human-augmented robotic telepresence system: The remote user (left) and the local user (right) are located in separate spaces and are wirelessly connected. The local user interacts with the remote person through a visually immersive digital human avatar and a physically interactive humanoid robot.

Figure 3. Humanoid robots used in our system: (a) Version 1 equipped with both arms and a head. (b) Version 2 equipped with both arms, legs, and a head. (c) Version 3 equipped with a full body and both hands.

Figure 4. The digital human avatar used in our system. The digital human model is rescaled to match the size of the humanoid robot in real-world scale.

Figure 5. Telepresence and capture prototype: (a) Prototype for the remote person, featuring six IMUs for body motion capture and an Apple Vision Pro [52] for video conferencing. (b) Prototype for the local person, equipped with XReal Light [53] for Augmented Reality (AR) and stereo cameras mounted on top of the HMD for local space observations.

Figure 8. Qualitative evaluation of the joint detector, showing selected frames of full-body humanoid robot pose estimation from head-worn camera views. The joint detector accurately identifies full-body joints even during fast head movements with motion blur, when parts of the robot are out of frame, or at varying distances, allowing for close-range interactions with the user.

Figure 9. Qualitative evaluation in world space. Each row presents SMPL [41] overlay results from a viewpoint in global space: (a) World space view. (b) Mocap using ground truth pose for the lower body, root rotation, and position. (c) Smplify [40] using ground truth position. (d) Ours, operating without external knowledge. The checkerboard pattern on the ground is used solely for external camera calibration in global space.

Figure 10. Qualitative evaluation on head-worn views. Each row displays SMPL [41] overlay results corresponding to the frames in Figure 9 from the input camera perspective: (a) Input head-worn view. (b) Mocap using ground truth pose for the lower body, root rotation, and position. (c) Smplify [40] using ground truth position. (d) Ours, operating without external knowledge. The checkerboard pattern on the ground is used solely for external camera calibration in global space.

Figure 11. Selected frames of digital human visualizations captured directly from the head-mounted display. The photos were taken externally, showing the view as visualized to the user’s eyes through the XR glasses.

Table 2. Full-body Humanoid Robot Pose dataset (HRP) for 2D joint detection in number of frames.

Train Set	Size	Validation Set	Size
Robot V1 (10 Joints)	2610	-	-
Robot V2 (14 Joints)	3360	-	-
Robot V3 (14 Joints)	5161	Robot V3 (14 Joints)	2968
Backgrounds (w/o robots)	2019	-	-
Total	13,150	Total	2968

Table 3. Full-body Humanoid Robot Pose dataset (HRP) for 3D pose evaluation in number of frames.

Evaluation Set	Size	Headset Motion	Annotations
Sequence 1 with Robot V3	1377	Static	2D, 3D
Sequence 2 with Robot V3	1496	Fast	2D, 3D
Total	2873	-	-

Table 4. Ablation study of the joint detector, evaluated on the HRP validation dataset using various dataset configurations. A 4-stage Hourglass network (13.0 M parameters and 48.66 GFLOPs) is used for each evaluation. The evaluation metric is PCK@t (Percentage of Correct Keypoints), where the threshold is

t \times

torso diameter. The best results are highlighted in bold. The dataset configurations are as follows: Dataset A (Robot V3), Dataset B (Robot V3 + Backgrounds), Dataset C (Robot V1, V2, V3), and Dataset D (Robot V1, V2, V3 + Backgrounds), as described in Table 2.

t \times

Dataset Class	mPCK	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]
Dataset A	84.56	99.99	99.96	98.40	80.39	44.06
Dataset B	84.36	99.97	99.96	97.93	80.31	43.63
Dataset C	85.10	99.99	99.98	98.47	80.89	46.19
Dataset D	84.73	99.99	99.94	98.43	81.37	43.91

Table 5. Quantitative evaluation on the Humanoid Robot Pose dataset (Table 3), showing per-bone average orientation errors (in degrees) in world space. The compared methods are Mocap (4 IMUs for arms, using ground truth pose for the lower body, root rotation, and position); Smplify_L = Smplify [40] (left view, using root position from ground truth); Smplify_R = Smplify (right view, using root position from ground truth); and Ours (2 views, 4 IMUs, with no external knowledge). The best results are highlighted in bold.

Method	Seq	Total	Upper Arm	Forearm	Thigh	Lower Leg
Mocap	Seq1	14.44	23.77	25.72	3.04	5.22
Smplify_L	Seq1	29.40	33.51	44.51	18.47	21.09
Smplify_R	Seq1	33.45	36.95	46.14	24.17	26.51
Ours	Seq1	12.16	16.58	19.85	5.74	6.48
Mocap	Seq2	18.31	30.27	34.17	3.58	5.22
Smplify_L	Seq2	28.75	32.73	45.95	17.31	19.01
Smplify_R	Seq2	28.51	32.71	46.12	16.78	18.44
Ours	Seq2	10.61	14.76	19.77	3.38	4.54
Mocap	Total	16.45	27.15	30.12	3.32	5.22
Smplify_L	Total	29.06	33.11	45.26	17.87	20.01
Smplify_R	Total	30.88	34.74	46.13	20.32	22.31
Ours	Total	11.36	15.64	19.81	4.51	5.47

Table 6. Quantitative evaluation on the Humanoid Robot Pose dataset (Table 3), showing per-joint average position errors (in cm) in world space. The compared methods are Mocap (4 IMUs for arms, using ground truth pose for the lower body, root rotation, and position); Smplify_L = Smplify [40] (left view, using root position from ground truth); Smplify_R = Smplify (right view, using root position from ground truth); and Ours (2 views, 4 IMUs, with no external knowledge). The best results are highlighted in bold.

Method	Seq	Total	Neck	Shoulder	Hip	Elbow	Wrist	Knee	Ankle
Mocap	Seq1	4.45	1.44	1.66	1.16	7.18	14.07	0.81	3.30
Smplify_L	Seq1	16.27	5.44	7.46	12.06	12.71	21.12	18.94	30.74
Smplify_R	Seq1	18.91	5.53	7.36	14.46	13.45	21.60	24.64	38.63
Ours	Seq1	6.31	4.57	4.74	4.98	7.20	9.85	6.19	5.75
Mocap	Seq2	5.37	1.78	2.00	1.30	8.42	18.00	0.96	3.34
Smplify_L	Seq2	15.84	6.64	8.41	10.78	12.47	21.57	17.46	28.98
Smplify_R	Seq2	15.69	6.47	8.17	10.61	12.48	21.70	17.24	28.56
Ours	Seq2	4.75	4.41	4.42	3.46	5.62	8.73	2.07	4.35
Mocap	Total	4.93	1.62	1.84	1.23	7.83	16.12	0.89	3.32
Smplify_L	Total	16.05	6.06	7.96	11.39	12.59	21.35	18.17	29.82
Smplify_R	Total	17.23	6.02	7.78	12.45	12.95	21.65	20.79	33.39
Ours	Total	5.49	4.49	4.57	4.19	6.38	9.27	4.04	5.02

Table 7. Quantitative evaluation on the Humanoid Robot Pose dataset (Table 3), presenting the mean Percentage of Correct Keypoints (mPCK in percent) in the input head-worn camera space. The methods compared include: Mocap (4 IMUs for arms, utilizing ground truth pose for the lower body, root rotation, and position); Smplify_L = Smplify [40] (left view, using root position from ground truth); Smplify_R = Smplify (right view, using root position from ground truth); and Ours (2 views, 4 IMUs, with no external knowledge). The best results are highlighted in bold.

Method	Seq	mPCK	Neck	Shoulder	Hip	Elbow	Wrist	Knee	Ankle
Mocap	Seq1	87.13	100.0	99.94	100.0	84.19	49.54	97.87	84.81
Smplify_L	Seq1	39.34	97.30	83.44	61.28	59.22	3.09	0.02	0.00
Smplify_R	Seq1	43.07	97.02	86.27	63.84	54.80	24.00	2.55	0.00
Ours	Seq1	86.82	100.0	92.27	100.0	83.13	63.02	91.59	84.36
Mocap	Seq2	82.79	99.67	93.99	99.85	84.79	37.60	86.16	85.93
Smplify_L	Seq2	34.20	92.88	69.54	62.24	40.86	2.30	0.39	0.49
Smplify_R	Seq2	34.09	94.17	70.42	62.01	39.70	1.74	0.29	0.32
Ours	Seq2	87.35	99.41	91.11	99.90	85.84	61.05	91.90	88.25
Mocap	Total	84.87	99.83	96.84	99.92	84.50	43.32	91.78	85.39
Smplify_L	Total	36.66	95.00	76.21	61.78	49.66	2.68	0.21	0.25
Smplify_R	Total	38.39	95.54	78.02	62.89	46.94	12.41	1.37	0.16
Ours	Total	87.10	99.69	91.66	99.95	84.54	61.99	91.75	86.38

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cho, Y.; Son, W.; Bak, J.; Lee, Y.; Lim, H.; Cha, Y. Full-Body Pose Estimation of Humanoid Robots Using Head-Worn Cameras for Digital Human-Augmented Robotic Telepresence. Mathematics 2024, 12, 3039. https://doi.org/10.3390/math12193039

AMA Style

Cho Y, Son W, Bak J, Lee Y, Lim H, Cha Y. Full-Body Pose Estimation of Humanoid Robots Using Head-Worn Cameras for Digital Human-Augmented Robotic Telepresence. Mathematics. 2024; 12(19):3039. https://doi.org/10.3390/math12193039

Chicago/Turabian Style

Cho, Youngdae, Wooram Son, Jaewan Bak, Yisoo Lee, Hwasup Lim, and Youngwoon Cha. 2024. "Full-Body Pose Estimation of Humanoid Robots Using Head-Worn Cameras for Digital Human-Augmented Robotic Telepresence" Mathematics 12, no. 19: 3039. https://doi.org/10.3390/math12193039

APA Style

Cho, Y., Son, W., Bak, J., Lee, Y., Lim, H., & Cha, Y. (2024). Full-Body Pose Estimation of Humanoid Robots Using Head-Worn Cameras for Digital Human-Augmented Robotic Telepresence. Mathematics, 12(19), 3039. https://doi.org/10.3390/math12193039

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Full-Body Pose Estimation of Humanoid Robots Using Head-Worn Cameras for Digital Human-Augmented Robotic Telepresence

Abstract

1. Introduction

2. Related Work

2.1. Human Avatars and Telepresence

2.2. Robotic Telepresence Systems

2.3. Human Pose Estimation

2.4. Humanoid Robot Pose Estimation

3. Digital Human-Augmented Telepresence System

3.1. Humanoid Robot

3.2. Digital Avatar

3.3. Robot Teleoperation in a Remote Space

3.4. Digital Human-Augmented Telepresence in a Local Space

4. Humanoid Robot Pose Estimation Using Head-Worn Cameras

4.1. Humanoid Robot Pose Dataset

4.2. CNN Joint Detector

4.2.1. Network Structure

4.2.2. Network Optimization

4.2.3. Training Details

4.3. Sensor Fusion for 3D Pose Estimation

4.3.1. Stereo 2D Keypoints Inference

4.3.2. Stereo Camera Pose Estimation

4.3.3. Three-Dimensional Joints Estimation

4.3.4. Global Pose Alignment

4.3.5. Visual–Inertial Alignment for Bone Orientations

4.3.6. Full-Body Motion Data

4.3.7. XR Visualization

4.3.8. Implementation Details

5. Results and Evaluation

5.1. Ablation Study on 2D Joint Detectors for Humanoid Robots

5.2. Performance Evaluation of 3D Pose Estimation for Humanoid Robots

5.2.1. Baseline Approaches

5.2.2. Evaluation in World Space

5.2.3. Evaluation on Head-Worn Views

5.3. Application

5.4. Network Efficiency

5.5. System Latency

6. Limitation and Future Work

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI