1. Introduction
A digital presence in remote locations facilitates social interaction between geographically separated sites. Telepresence systems eliminate the need for physical travel by utilizing digital avatars as replacements for physical presence. We anticipate that such telepresence systems will gain significant traction and become increasingly commonplace within daily work environments. Furthermore, we expect 3D telepresence systems to reduce travel costs and greatly enhance productivity through remote working in everyday life.
Existing immersive 3D telepresence systems [
1,
2,
3], shown in
Table 1a, demonstrate promising results for real-time remote presence by visualizing a digital human avatar of a remote person in 3D, enabling full-body visual interactions between remote individuals. However, despite offering visual immersion of the remote person, such systems lack the capability to facilitate physical interactions with them. This limitation includes the inability to physically manipulate remote objects, such as touching and holding them, thereby restricting the potential benefits of remote working through telepresence.
Mobile Robotic Telepresence (MRP) systems [
4,
5], shown in
Table 1b, offer users mobility in remote sites by enabling remote control of telepresence robots. Users navigate through remote locations using shared 2D camera views from the robot. However, existing commercial telepresence robots, which lack human-like features such as arms, are unable to physically interact with objects in remote sites. This limitation restricts the level of interaction possible between remote individuals.
Prior robot teleoperation systems [
6,
7,
8,
9,
10], shown in
Table 1c, support users in making remote physical contact through the manipulation of humanoid robots. These robots share 2D camera views from remote sites and perform physical actions in the environment based on the user’s remote operations. However, existing systems rely on complicated interfaces for the robot controllers and are difficult to apply to different types of robots. Additionally, the robot operation is limited to the area directly in front of the controllers. These features make the system location-restricted and difficult to use conveniently in everyday workplaces.
Table 1.
Comparison of remote interactions between telepresence systems.
In this paper, we introduce a novel real-time 3D telepresence system designed for both physical interactions in remote locations and immersive 3D visual interactions between remote individuals, shown in
Table 1d. The remote user is represented in two forms: a motion-retargeted humanoid robot and a digital human overlaid on the robot in a separate space. Equipped with our inertial sensor-based motion capture system, the remote user can conveniently control the humanoid robot and interact with remote objects through it. With the body-worn motion capture system, users can wirelessly operate the robot from any location, eliminating the need for complex robot control systems. An overview of our digital human-augmented robotic telepresence system is provided in
Figure 1.
Rather than encountering a metallic robot that lacks realism, our system enhances the sense of reality by overlaying a digital human on the robot. The local user engages with the remote person’s digital avatar through a head-mounted display (HMD), facilitating immersive 3D visual interactions.
A key challenge in overlaying a digital human onto a humanoid robot is consistently estimating the robot’s full-body pose over time from the perspective of a moving person. Errors can arise from various factors, such as Visual SLAM (Simultaneous Localization and Mapping) inaccuracies in tracking the user’s head, occlusions of body parts, or parts of the robot being outside the camera’s view.
Visual SLAM is employed to track the user’s headset relative to the world space, with digital human augmentation occurring in the coordinate frame of the moving headset. While external knowledge of the target robot’s full-body pose can be provided, inaccuracies in the digital human overlay tend to worsen with increased head movement. Consequently, the accuracy of the digital human overlay is heavily dependent on the performance of the Visual SLAM system, leading to greater discrepancies during rapid head motions.
Previous approaches (e.g., [
12,
13]) rely on marker-based pose estimation to align the digital human avatar with the robot. However, these methods require robot-specific marker configurations and precise calibration, posing challenges when applied to different types of robots.
Our approach addresses the challenges posed by headset tracking errors by directly estimating the robot’s pose from images captured by moving head-worn cameras. This method allows for pose estimation that operates independently of runtime headset tracking, enabling our system to function without relying on external knowledge of the robot’s global pose in arbitrary environments.
Additionally, we employ a learning-based method for full-body pose estimation in humanoid robots that does not require markers. This method maintains the visual fidelity of the robot and can be adapted to any human-like robot, provided the appropriate dataset is available. As a result, it eliminates the need for marker calibration and mitigates issues related to occlusions or invisible markers.
We demonstrate our system in a remotely assisted physical training scenario, showcasing both immersive and physical telepresence capabilities. To support training and evaluation, we collected a new large-scale dataset for humanoid robot pose estimation. To the best of our knowledge, no existing dataset includes full-body pose annotations for humanoid robots [
14]. We plan to make our dataset publicly available to contribute to the research community. In our experiments, our learning-based pose estimation runs in real time at 40 Hz on a standard PC, enabling marker-free alignment of digital humans and supporting immersive digital human-augmented robotic telepresence using only head-worn cameras in practical applications.
Our main contributions are summarized as follows:
A working, proof-of-concept prototype of a telepresence system that supports immersive 3D visual and physical interactions simultaneously between remote individuals.
A learning-based, marker-free, full-body pose estimation approach for humanoid robots only using moving head-worn cameras.
2. Related Work
Our research focuses on the intersection of three key areas: human avatars, telepresence, and humanoid robot pose estimation. Numerous studies have explored these areas, paving the way for advancements in our current research. This section provides a brief overview of relevant work, starting with foundational concepts and progressing towards recent developments.
2.1. Human Avatars and Telepresence
Human avatars play an increasingly prominent role in Virtual Reality (VR), telepresence, and human–computer interaction (HCI) applications [
15,
16,
17,
18]. Recent research emphasizes creating high-fidelity avatars [
19,
20] with enhanced realism and expressiveness for user control [
21,
22,
23]. These advancements directly contribute to the goal of telepresence systems, which strive to create a strong sense of presence for the remote user [
1,
24,
25,
26,
27,
28,
29]. Notable examples include Telesarphone, a telexistence communication system using retroreflective projection technology [
27], and TELESAR V, which facilitates transferring bodily consciousness in telexistence [
28]. Similarly, the Beaming system explores asymmetric telepresence, further enhancing user experience in remote environments [
29]. Advancements in Augmented Reality (AR), Mixed Reality (MR), and Virtual Reality (VR) technologies further enhance the potential of telepresence systems. Studies like Holoportation [
1] and ViGather [
26] demonstrate the feasibility of utilizing these advancements to create realistic and immersive telepresence experiences.
Telepresence systems facilitate social interaction between remote individuals by providing virtual human avatars through head-worn displays, representing visual illusions of remote persons. This technology has applications in various fields, including remote consultations and conferences. Research efforts have explored the use of telepresence in these contexts, as evidenced by studies on teleconsultation [
24], virtual conferences [
25,
26], clinical care [
30], and telexistence communication [
27,
28].
However, virtual avatar-based telepresence systems often lack realism without supporting physical interactions. Building on this research, our system leverages XR technologies to enhance the sense of realism in telepresence, offering both visual and physical interactions simultaneously for a more immersive and engaging user experience.
2.2. Robotic Telepresence Systems
Studies on robotic telepresence systems represent an active area of research within the broader field of telepresence. Specifically, Mobile Robotic Telepresence (MRP) systems aim to enhance social interactions among individuals through the use of mobile robots, and they are gaining increasing traction in healthcare applications [
4]. These telerobots allow users to navigate remote locations using a robot-mounted camera in real time, and previous studies indicate that such systems can improve the quality of life for individuals with special needs [
5]. Early approaches focused on equipping robots with screens displaying the remote user’s face, essentially enabling a robotic video call [
31]. However, recent advancements have delved into more sophisticated interaction capabilities.
One such advancement involves enabling physical interactions between remote individuals through haptic devices. This technology allows users to experience the forces and sensations perceived by the robot, thereby enhancing the sense of immersion and realism. Studies by Schwarz et al. [
32], Luo et al. [
8], and Lenz et al. [
33] explore the integration of haptic feedback into telepresence systems, demonstrating its potential to create more realistic and interactive user experiences.
The integration of the remote user’s 3D avatar with the robot’s physical structure enhances telepresence by providing a more lifelike representation of the remote user, thereby improving the sense of presence. Tejwani et al. [
12] introduced an Avatar Robot System that overlays a 3D human model onto an upper-body robot through Augmented Reality (AR), leading to improved realism during robot interactions. Similarly, Jones et al. [
13] introduced a full-body human avatar onto a telerobot using checkerboard-based pose estimation, which improved user experiences in telepresence interactions. Both studies highlight the significant potential of digital human augmentation to enhance interaction and user satisfaction during telepresence.
Our research approach builds on these advancements by visualizing a digital avatar on a humanoid robot through head-worn displays during robot teleoperation. This integration not only enables the remote user to control the robot’s actions but also facilitates immersive visual interaction for the local user via the digital human-augmented humanoid robot. The combination of visual and physical interactions enhances the system’s realism, allowing the local user to feel more connected to the remote user’s actions and responses. This approach has not been thoroughly explored in previous studies on robotic telepresence, presenting a unique avenue for investigation.
2.3. Human Pose Estimation
Human pose estimation is closely related to humanoid robot pose estimation due to the similarities in joint structures and the shared goal of accurate pose estimation. Advances in deep learning have led to significant progress in human pose estimation, resulting in substantial improvements [
2,
3,
34,
35,
36,
37,
38,
39]. For example, Eldar Insafutdinov et al. [
35] and Zhe Cao et al. [
36] significantly improved real-time multi-person 2D pose estimation using Convolutional Neural Networks (CNNs).
Recent approaches in 3D human pose estimation have demonstrated further advances by utilizing parametric body models [
38,
40], such as SMPL [
41]. By fitting these body models onto images, anatomical constraints can be applied, leading to significantly improved accuracy compared to methods based on individual joint detection. However, inferring 3D poses from 2D representations introduces inherent pose ambiguities, which pose challenges in estimating diverse poses from images, particularly in cases of side views or self-occlusions.
To mitigate visibility issues, body-worn sensors such as Inertial Measurement Units (IMUs) have been introduced for pose estimation approaches. These inertial sensors can measure the orientations of body segments regardless of camera visibility. T. von Marcard et al. [
42] achieved promising results in full-body 3D pose estimation using only sparse inertial sensors. Yinghao Huang et al. [
43] improved this method to work in real-time by utilizing Recurrent Neural Networks (RNNs). However, relying solely on sparse sensors can lead to issues such as IMU drift and pose ambiguities.
Combining visual and inertial sensors can improve pose estimation, even in side-view and occlusion scenarios [
44]. However, requiring numerous sensors can be burdensome for users and impractical for everyday applications.
Learning-based human pose estimation methods require large-scale pose datasets for training [
45,
46]. However, directly applying these techniques to humanoid robots presents challenges due to the lack of large-scale datasets for humanoid robots. Additionally, adaptations are needed to effectively apply learning-based pose estimation to humanoid robots.
2.4. Humanoid Robot Pose Estimation
Due to the distinct characteristics of humans and humanoid robots, directly applying human pose estimation methods to humanoids is impractical. This is because outfitting robots with wearable devices and integrating them into a network imposes additional burdens for conventional applications. Our approach addresses cases where the robot’s appearance may vary, focusing on general applicability as long as the robot’s topology—such as one head, two arms, two legs, and a body similar to human proportions—resembles that of a human being.
Justinas Miseikis et al. [
47] have presented a method utilizing convolutional neural networks (CNNs) to detect robot joints. Timothy E. Lee et al. [
48] have conducted research on extracting robot keypoints from a single image without the need for markers or offline calibration. Jingpei Lu et al. [
49] proposed an algorithm for automatically determining the optimal keypoints for various parts of a robot. Traditional robot pose estimation often focuses on robotic arms rather than full humanoid forms. Furthermore, since robot joints differ from human joints, conventional methods of converting robot poses into human poses are not feasible.
Given the differences between humans and humanoid robots, training models for humanoid robot pose estimation requires a dataset different from those used for human pose estimation. However, due to the scarcity of such datasets, we have developed our own dataset for training purposes. This initiative aims to overcome the limitations of existing research and pave the way for more accurate and adaptable humanoid pose estimation techniques.
3. Digital Human-Augmented Telepresence System
To achieve convenient immersive remote working, our goal is to develop a telepresence system that supports both physical and digital avatars simultaneously. The physical avatar is realized through a humanoid robot operated by a remote user, enabling physical interactions in remote places. The digital avatar is represented as a digital human, providing visual immersion of the remote person by overlaying the digital human onto the robot. An overview of our humanoid robot-aided 3D telepresence system is shown in
Figure 2.
Our prototype system comprises two main components located in separate spaces: A user in a remote location, as shown on the left in
Figure 2, can perform tasks in the local space through a humanoid robot. A local user, as shown on the right in
Figure 2, can physically interact with this robot operated by the remote user and observe the immersive visual avatar of the remote person through a head-mounted display (HMD). Both users are connected wirelessly to a single machine that transmits full-body poses in real time.
At the remote site, the user wears inertial sensors for motion capture and an HMD for visual communication with the local person. The captured motion of the remote user is transmitted wirelessly to both the humanoid robot and the machine for pose estimation of the digital avatar in the target space. The humanoid robot and the digital avatar are deformed based on the received motion through forward kinematics.
The local user is equipped with an HMD for visual communication, augmented with a stereo camera rig to observe the local space. Given the stereo rig imagery, our proposed learning-based robot pose estimation approach estimates the 3D pose of the observed robot in the local space. The estimation is wirelessly sent to the XR headset of the local user, and the HMD visualizes the digital avatar of the remote person in the estimated position and pose so that the visual avatar is overlaid on the humanoid robot from the local user’s perspective for immersive 3D visual interactions. The details of the learning-based humanoid robot pose estimation method are described in
Section 4. Additionally, the local user can physically interact with the remote person through the humanoid robot avatar.
3.1. Humanoid Robot
In the proposed telepresence system, the humanoid robot is utilized as a physical human avatar of the remote person for remote physical interactions. To achieve our objective, we implemented humanoid robots, categorized into three versions, as shown in
Figure 3.
The first version of the dual-arm robot, shown in
Figure 3a, is equipped with 14 degrees of freedom (DoF) for both arms, including 3D shoulders, 1D elbows, and 3D wrists, along with a camera mounted on the head in a fixed location. It employs an EtherCAT network with a 1 kHz communication cycle.
The second version of the humanoid robot, shown in
Figure 3b, features a total of 19 DoF. It includes 8 DoF for both arms, comprising 3D shoulders and 1D elbows, and 10 DoF for both legs, consisting of 2D hips, 1D knees, wheels, and anchor supports. Additionally, it has 1 DoF in the waist and a camera on the head. This version uses EtherCAT with a 2 kHz communication cycle to improve stability in motion control.
The third version of the humanoid robot, shown in
Figure 3c, is based on the same specifications as the version in
Figure 3b, with the addition of the qb SoftHand [
50] for both hands, each equipped with 1 DoF, enabling grasping motions to manipulate objects. The SoftHand employs RS485 network communication.
The first two versions of the humanoid robots are used solely for generating the Humanoid Robot Pose dataset, as described in
Section 4.1. The third version is used for the demonstration of our remote physical therapy scenario and for the training and evaluation of the robot pose estimation method.
The current version of the robot is in a fixed location; however, we plan to enable mobility in the next version to showcase remote working in various outdoor scenarios.
3.2. Digital Avatar
In our telepresence system, the digital human model acts as a visual human avatar for the visual immersion of the remote user, enhancing the realism of remote work between separate locations. An example of the digital human model is shown in
Figure 4.
The remote user’s appearance is prescanned as a digital 3D model in our capture studio, which is equipped with tens of high-resolution cameras [
2,
3]. The prescanned T-posed 3D human model is transformed into an animatable full-body human model by manually adding rigging information using Blender software [
51].
To accurately overlay the digital human model onto the humanoid robot, we adjust for differences in shape and height between the user and the humanoid robot. Using the Unified Robot Description Format (URDF) of our humanoid robot, we manually rescaled the prescanned rigged human model, as shown in
Figure 4, improving the 3D alignment between the humanoid robot and the digital human model in the XR headset.
3.3. Robot Teleoperation in a Remote Space
To enable robot teleoperation, a remote user wears inertial sensors to control the robot and a head-mounted display (HMD) for video conferencing, as shown in
Figure 5a. By using body-worn inertial sensors, our system avoids the need for complex interfaces typically required in existing teleoperation systems [
6,
7,
8,
9,
10].
The remote user wears six inertial sensors (Xsens MTw Awinda [
54]) on the upper arms, lower arms, and hands to capture upper body motion. These measurements are wirelessly transmitted to a server, which relays the data to the humanoid robot in the local space. The robot continuously adapts its arm poses based on the remote user’s movements.
The remote operator’s pose is represented by the rotations of both arms,
, and grab motions,
g. The body-worn inertial sensors are calibrated in the initial frames using the methods described in [
3,
43]. In the beginning, both the remote user and the humanoid robot assume a straight pose for a few seconds to set the inertial measurements,
, to identity. At time
t,
is obtained from the calibrated inertial sensors.
During the calibration process, the sensors on the hands are also set to identity, with the assumption that both the user and robot have their hands open. A binary grab motion indicator, , is triggered when the relative rotation between the hand and lower arm exceeds . The robot then opens or closes its hand based on this grab motion.
For immersive visual interaction with the local environment, the remote user wears an Apple Vision Pro [
52] for video conferencing. A GoPro camera [
55] mounted on the humanoid robot’s head shares its field of view with the remote user. Using Zoom [
56], the remote user can observe the local space from the robot’s perspective and engage in real-time conversations with the local individual. In this prototype, the wearable inertial sensors, the HMD, and the humanoid robot are wirelessly connected, making the system potentially operable in mobile scenarios.
3.4. Digital Human-Augmented Telepresence in a Local Space
To demonstrate the proof-of-concept prototype for digital human-augmented robotic telepresence, the local user wears camera sensors and an HMD for 3D telepresence observation, as shown in
Figure 5b. Through this prototype, the local user can observe the digital human overlay representing the remote person while physically interacting with the humanoid robot.
The local user wears an XR headset (XReal Light [
53]) equipped with a stereo camera rig for immersive telepresence interactions with the remote user. The stereo rig consists of synchronized miniature stereo cameras (160° FoV, 60 Hz) that provide a wide field of view, enabling full-body visibility of the robot even at close distances, while the high frame rate minimizes motion blur during fast movements. This setup enables immersive telepresence, allowing the user to maintain visual contact with the digital human model overlaid onto the humanoid robot, even during close interactions such as shaking hands. (Please refer to the
Supplemental Video).
The stereo cameras are rigidly mounted on the XR headset, and their relative poses are precalibrated using a standard checkerboard calibration [
57]. During operation, the stereo camera poses are updated based on the headset’s motion, which is tracked using the built-in Visual SLAM system.
Using the stereo rig’s observations, the 3D pose of the humanoid robot is estimated from the headset’s perspective. Detailed humanoid robot pose estimation is covered in
Section 4. The digital human model is then overlaid in the XR headset, aligned with the humanoid robot’s estimated pose for real-time augmented telepresence.
Currently, the stereo rig is connected to the local server via a wired connection. In future iterations, we plan to implement a wireless connection for the stereo rig, similar to other wearable sensors, to ensure the system remains operable in mobile scenarios.
4. Humanoid Robot Pose Estimation Using Head-Worn Cameras
Working toward our goal of simultaneously supporting the immersive visual avatar and the physical avatar, our system overlays the remote user’s digital human model onto the humanoid robot from the local user’s perspective. For widespread usability, we propose a learning-based, marker-less, approach for 3D pose estimation of humanoid robots. Our learning-based method eliminates the need for attaching markers or performing calibration between them. Additionally, estimating from the moving user’s perspective removes the requirement for calibration between the robot’s global space and the observer’s local space. As a result, our approach can be easily extended to other humanoid robots with appropriate pose datasets, and to outdoor, mobile environments. The full-body pose estimation pipeline for 2D and 3D is illustrated in
Figure 6 and
Figure 7, respectively.
Our method estimates the locations of joints on images, similar to learning-based human pose estimation approaches [
58]. We collected a full-body humanoid robot dataset to train the regression-based convolutional neural network (CNN) [
59]. To the best of our knowledge, this is the first dataset for full-body humanoid robots. The details of the dataset and 2D joint detection are described in
Section 4.1 and
Section 4.2 respectively.
The stereo cameras are calibrated with respect to the headset, whose pose is continuously updated via the built-in Visual SLAM. Using the tracked stereo camera poses, the locations of the 3D joints in global space are estimated by triangulating the detected 2D joints from the stereo rigs. The details of the 3D pose estimation are described in
Section 4.3.
Vision-based 3D pose estimation is not successful when parts of the humanoid robot are self-occluded or out of the image during close interactions between the user and the robot. Additionally, the estimated headset pose from the built-in Visual SLAM is inaccurate during fast head motions. To compensate for these error sources, we further update the 3D pose of the robot using measurements from the sparse inertial sensors worn by the remote person (only for arms). The inertial sensor measurements can be continuously captured even if body parts are occluded or out of the image. However, the sensor measurements are prone to drift over time, so we update the calibration in real time via the visual–inertial alignment described in
Section 4.3. The IMU measurements are used for pose estimation in situations where body parts are unobserved.
The resulting estimated 3D pose of the robot in world space is continuously transmitted to the headset. The digital human model is deformed using this pose and is displayed immersively on the stereo headset display. This allows the local user to see the digital human avatar of the remote user overlaid on the humanoid robot.
4.1. Humanoid Robot Pose Dataset
For widespread acceptability, we decided to exploit markerless, learning-based pose estimation for humanoid robots with our prototype. However, none of the available humanoid robot datasets support full-body pose annotations [
14], unlike human pose datasets [
45]. Therefore, we collected a new Humanoid Robot Pose dataset for full-body support using the target humanoid robots shown in
Figure 3.
We recorded various types of motions from different viewpoints using multiple cameras. For the training dataset, we collected 14 sequences for Robots V1, V2, and V3, as shown in
Figure 3, using ten synchronized fixed cameras and three moving cameras. Each image sequence was uniformly subsampled for annotations. The joint locations of the full body were manually annotated using the annotation software [
60].
The joint structure includes the nose, neck, shoulders, elbows, wrists, hips, knees, and ankles, for a total of 14 joints. The resulting 15 k dataset for training is summarized in
Table 2, which includes the dataset size for each robot type. Random background images, without any objects, were used during training to determine the best configuration of the dataset.
For evaluation, we recorded two sequences using four multiview fixed cameras alongside our moving stereo camera rig, all synchronized in frame capture. The multiview fixed cameras were calibrated to establish a global coordinate system, and the 2D joint locations extracted from their images were used to generate ground-truth 3D joint sets. The camera poses of the stereo cameras, along with their images, served as input for estimating the 3D pose. The estimated 3D pose was then evaluated against the ground-truth 3D joint sets to assess performance.
Manually annotating 2D keypoints in all images is time-consuming, and the human-annotated locations may not be multiview consistent in 3D global space. Therefore, we generated the ground-truth data semi-automatically by training the pose detector separately, using both the training dataset and automatically generated data samples as follows.
Using the method described in
Section 4.2, the initial pose detector was trained with the training dataset only. For the images from the four fixed cameras, the initial pose detector was utilized to extract raw 2D joint locations. Using the precalibration of the fixed cameras, initial 3D joint locations were extracted by triangulation from the four viewpoints. By projecting the 3D joints in global space to each fixed camera, globally consistent 2D joint locations were generated. Incorporating these generated samples with the training dataset, we retrained the pose detector to fit the newly added samples. We reprocessed the back-projection using the improved detector and generated more improved samples on the images from the fixed cameras. In the experiment, two rounds of training were sufficient to extract globally and temporally consistent 3D joint sets for evaluation.
Finally, the pose of the moving stereo camera rig was estimated using stereo Visual SLAM over time [
61]. We evaluated the proposed method using the registered stereo images with the globally consistent 3D joint dataset. The resulting 2.8 k evaluation dataset is summarized in
Table 3. Sequence 1 includes minimal camera movements, and Sequence 2 incorporates fast camera movements to measure performance in different scenarios.
4.2. CNN Joint Detector
To achieve our goal of markerless full-body pose estimation for humanoid robots, we utilize a convolutional neural network (CNN) with our collected dataset, as described in
Section 4.1. Specifically, we employ a regression-based convolutional neural network (CNN) [
59] paired with a stacked hourglass [
58] backend for our joint detector.
The stacked hourglass architecture is highly regarded in the domain of human pose estimation, exhibiting performance metrics that are competitive with those of leading methodologies such as OpenPose [
36] and ViTPose [
62]. Notably, it has achieved state-of-the-art results on the MPII dataset [
46], as highlighted in [
63].
To enhance computational efficiency, we adopted the DSNT regression module, which allows the network to directly estimate 2D keypoint coordinates. This approach removes the necessity of parsing heatmaps for coordinate extraction during runtime. Additionally, we train the network similarly to the method described in [
3].
4.2.1. Network Structure
The joint detection network
takes an image
as input (
) and outputs joint coordinates
along with confidence maps
, where the number of keypoints
. We use four stacked stages to balance accuracy and speed. The structure of the multistage network is illustrated in
Figure 6.
Figure 6.
Network structure for the 2D joint detector: Given a single input image, the 4-stage network outputs the keypoint coordinates
K and their confidence maps
H. The Hourglass module [
58] outputs unnormalized heatmaps
, which are propagated to the next stage. The DSNT regression module [
59] normalizes
and produces
H and
K.
Figure 6.
Network structure for the 2D joint detector: Given a single input image, the 4-stage network outputs the keypoint coordinates
K and their confidence maps
H. The Hourglass module [
58] outputs unnormalized heatmaps
, which are propagated to the next stage. The DSNT regression module [
59] normalizes
and produces
H and
K.
In each stage, the Hourglass module [
58] infers the unnormalized heatmaps
for joints. The DSNT regression module [
59] normalizes
into
H using a Softmax layer.
H is then transformed into 2D coordinates
K by taking the dot product of the X and Y coordinate matrices.
The j-th joint confidence is read out at the estimated 2D coordinate . When confidence is large enough (, with ), the joint is considered detected, and the joint visibility is set to 1; otherwise, it is set to 0.
4.2.2. Network Optimization
The joint detector network is trained to minimize the loss function
, similar to the method in [
3].
To estimate 2D coordinates for visible joints where
, the
regression loss is applied as:
where the superscript
gt represents the ground truth data for training, and
indicates the binary visibility for each joint in the ground truth data.
is a 2D Gaussian map drawn at coordinates
with a 2D standard deviation
. For training,
is used.
denotes the Jensen–Shannon divergence, which measures the similarity between the confidence map
H and the Gaussian map generated using the ground truth data [
59].
In order to discourage false detections, the
invisibility loss is applied to the unnormalized heatmaps
as,
suppresses to a zero heatmap for invisible joint, where . The zero heatmap results in a uniform distribution in H, which encourages the joint confidence to be smaller for invisible joints.
4.2.3. Training Details
The network is trained in multiple stages. First, it is pretrained on the COCO 2017 Human Pose dataset [
45] to learn low-level texture features. Then, it is further trained on our custom dataset, as described in
Section 4.1. Intermediate supervision is applied at each stage of the network to prevent vanishing gradients and ensure effective learning [
36].
To improve detection robustness, several data augmentation techniques were employed. These include random image transformations such as scaling, rotation, translation, and horizontal flipping. Additionally, random color adjustments, including contrast and brightness changes, were introduced to account for varying lighting conditions. Based on empirical observations, random blurring and noise were also added to address the complexity of robot images, which often include intricate textures such as cables, electronic components, and body covers with bolts and nuts.
The four-stacked hourglass network requires training 13.0 M parameters and is trained for 240 epochs using the RMSprop optimizer. On a machine with an Intel Xeon Silver 4310 processor, 128 GB of RAM, and dual NVIDIA RTX 4090 GPUs, the training process takes approximately 10.5 h.
4.3. Sensor Fusion for 3D Pose Estimation
The full-body pose estimation of the humanoid robot from the local user’s perspective takes three inputs: the partial joint rotations (both arms) of the remote user acquired from body-worn inertial sensors, the headset pose from the XR glasses of the local user, and the stereo images from the camera rig mounted on the headset, as shown in
Figure 2 and
Figure 5. At runtime, these heterogeneous sensor measurements are utilized to localize and deform the digital human, and the resulting estimated pose provides the local user with an XR overlay on the humanoid robot. The full-body pose of the observed humanoid robot is estimated in six stages, as illustrated in
Figure 7.
Figure 7.
Full-body 3D pose estimation pipeline: Heterogeneous sensor data from the XR headset and stereo camera rig worn by the local user, combined with inertial sensors worn by the remote person, are fused to estimate the full-body pose of the digital human from the local user’s perspective.
Figure 7.
Full-body 3D pose estimation pipeline: Heterogeneous sensor data from the XR headset and stereo camera rig worn by the local user, combined with inertial sensors worn by the remote person, are fused to estimate the full-body pose of the digital human from the local user’s perspective.
4.3.1. Stereo 2D Keypoints Inference
In the first stage, the stereo camera views, denoted as
and
, mounted on the XR headset are processed simultaneously by a single joint network. The keypoint detector estimates the humanoid robot’s keypoints,
and
, from each of the stereo images, respectively (see
Figure 7).
The joint detection network with four stacked stages for inference requires 48.66 GPU floating-point operations per second (GFLOPs). The performance of the joint detector is summarized in
Table 4. In our experiments, keypoint detections from the stereo images achieved an average processing speed of 45 FPS.
4.3.2. Stereo Camera Pose Estimation
The poses of the stereo camera rigs,
and
, relative to the headset,
, are precalibrated as described in
Section 3.4. In the second stage, the moving camera poses,
, are estimated based on the tracked headset pose,
, from the built-in Visual SLAM, where
at time
t. We utilize the SLAM coordinate system as the global reference frame.
4.3.3. Three-Dimensional Joints Estimation
In the third stage, 3D joint positions in global space
are estimated using the 2D keypoints (
and
) and the stereo camera poses (
and
). To compute the 3D joint locations
of the humanoid robot in world space, we determine the ray intersections between
from camera
and
from camera
. We apply the standard Direct Linear Transformation (DLT) algorithm to calculate these intersections, as detailed in [
57].
4.3.4. Global Pose Alignment
In the fourth stage, the pose of the root joint of the humanoid robot is estimated in global space using the 3D joint locations
. The root transformation
, representing the global rotation
, and the global translation
of the digital human is estimated by minimizing the joint position errors,
, between the T-posed initial joint locations
of the digital human and the current estimated 3D joint positions
as,
where
are the stationary torso joints, including shoulders, hips, neck, and nose.
is estimated using the standard point-to-point ICP solver [
64].
4.3.5. Visual–Inertial Alignment for Bone Orientations
In the fifth stage, the limb poses are estimated by temporally aligning the 3D joint locations of the local humanoid robot with the inertial measurements received from the remote person.
Although the inertial sensors are initially calibrated, as described in
Section 3.3, they tend to experience drift over time. This drift can result in misalignments between the poses estimated from the remote user’s inertial measurements and those observed from the stereo images of the local humanoid robot.
Since the 3D joint estimation method using stereo images (as described in
Section 4.3.3) operates on a per-frame basis, it can exhibit significant depth errors even with minimal pixel errors in the 2D keypoint locations.
Figure 7 illustrates the depth errors of the estimated 3D joint locations caused by slight inaccuracies in the 2D keypoints. The joint locations (shown in red) should ideally reflect a straight pose, but instead display bent arms due to triangulation errors in the stereo 2D keypoints. This example demonstrates that such errors can be corrected by applying the corrected inertial measurements (shown in yellow) through visual–inertial alignment.
To reduce the misalignment between the inertial measurements and the visual observations, we employ the visual–inertial alignment method proposed in [
3]. This method extracts the bone directions
from the inertial sensors and the bone directions
from the estimated 3D joint locations
in world space whenever the corresponding joints are detected. To mitigate the misalignment, the IMU offset rotation
is estimated so that all inertial bone directions
align with the corresponding visual bone directions
.
A bone direction,
, represents the directional vector from the location of the base (parent) joint to the tip (child) joint.
can be computed using either inertial measurements or using 3D joint locations as,
where
represents the rotation matrix acquired from the calibrated body-worn inertial sensor.
denotes the second column vector of the rotation matrix.
indicates the parent index of the joint. Note that
and
are represented in the digital human’s local coordinate space, obtained by canceling out the root transformation
.
The IMU offset matrix
can be estimated by solving the least square problem:
is computed using the online k-means algorithm, with recent measurements given more weights than past ones to react to sudden sensor drift [
3].
4.3.6. Full-Body Motion Data
The full-body pose of the digital human,
is represented as a concatenation of the global position
, rotations of the stationary bones
, and rotations of the visually aligned moving limb bones
in the world space of the local person. In the sixth stage, the full-body motion data,
, are updated for network transmission to the XR headset of the local user. The rotations
and
are computed as
where
is the rotation matrix in the initial T-pose of the digital human model.
4.3.7. XR Visualization
At run-time, the digital human is animated over time according to the estimated 3D pose. The posed digital human is visualized through the XR headset to provide digital human augmentation. Instead of wirelessly transmitting the entire surface of the digital human every frame, only a much smaller-sized full-body pose data is sent between the pose estimation server and the XR headset. This approach ensures efficient network usage for the digital human’s deformation over time.
At runtime, the estimated full-body pose of the digital human, , is continuously sent to the XR headset of the local user. Within the XR headset, the digital human is deformed based on and visualized at the global position , achieving a real-time XR overlay of the digital human onto the humanoid robot. In experiments, the full-body pose estimation process, including CNN joint detection, achieves an average frame rate of 40 FPS.
4.3.8. Implementation Details
The 3D joint location estimation operates on a per-frame basis, as described in
Section 4.3.3. In the current setup with humanoid robot V3, shown in
Figure 3c, the robot’s root transformation remains stationary in world space, and its lower body’s pose is fixed. These relaxed constraints are applied to the 3D pose estimation solver, enabling the inclusion of temporal information.
When estimating 3D joint positions
, the knees and ankles are additionally included in the stationary joint set
. The online averaged 3D joint locations
for these fixed torso joints are computed, assuming they stay stationary in world space. This constraint reduces motion jitter, allowing the root transformation
to converge rapidly, typically within a hundred frames. Future work aims to extend this method to walkable humanoid robots by applying temporal optimization techniques from [
3].
The full-body pose estimation method is designed to handle occlusions and out-of-frame challenges. When limb joints are not visible in stereo images, the corresponding 2D keypoints may be missing, causing failures in 3D joint estimation as detailed in
Section 4.3.3. However, the visual–inertial alignment, described in
Section 4.3.5, allows for limb orientation estimation in Equation (
8), enabling continued operation despite partial occlusions.
In contrast, the system encounters failure if torso joints, such as the shoulders or hips, are unobserved. The absence of these joints affects the root transformation
, due to insufficient correspondences for the ICP solver, as explained in Equation (
3).
5. Results and Evaluation
In this section, we present evaluations of our humanoid robot 3D pose estimation method for real-time 3D telepresence. Additionally, we showcase results for a potential application in remote physical training, demonstrating how our telepresence system provides immersive visual interactions and physical manipulations in remote work scenarios.
5.1. Ablation Study on 2D Joint Detectors for Humanoid Robots
In this section, we evaluate the performance of our CNN-based joint detector (described in
Section 4.2) using the HRP training dataset. The detector was trained on this dataset and validated using the HRP validation dataset, as outlined in
Table 2.
To ensure robust performance, the joint detector was trained using various dataset configurations: images from the V3 robot only, images from all robot versions (V1–V3), and images with random backgrounds. These configurations are detailed in
Table 4.
For evaluation, we use the Percentage of Correct Keypoints (PCK) as our metric, which is consistent with previous human pose estimation work [
58,
62,
65,
66]. The MPII dataset [
46] commonly uses
[email protected], where predicted joints within a distance of
the torso diameter are considered correct.
For a stricter evaluation of our HRP validation dataset, performance was assessed using PCK across multiple thresholds (0.1, 0.05, 0.025, 0.0125, 0.00625), and the mean PCK (mPCK) was calculated across these thresholds.
The results of 2D pose estimation using different dataset configurations are summarized in
Table 4. The evaluation shows that using images from all robot versions (V1–V3) provided slightly better performance compared with other configurations, while adding random background images did not result in performance improvement. Nonetheless, all configurations achieved over 97% accuracy with
[email protected], which corresponds to an average pixel error of 6.22 pixels.
For the remainder of the experiments, the joint detector trained on Dataset C (Robot V1 + V2 + V3), as shown in
Table 4, is selected as the optimal detector.
Figure 8 shows the 2D pose estimation results from various scenarios using head-worn cameras. The results demonstrate that our joint detector effectively detects full-body joints, even during fast head motions, under significant motion blur, or when parts of the body are out of frame or at close distances.
5.2. Performance Evaluation of 3D Pose Estimation for Humanoid Robots
Our learning-based full-body 3D pose estimation method for humanoid robots is not directly comparable to any previous approaches we are aware of. Prior robot pose estimation methods [
12,
13] only work with specific types of robots with markers and do not function with generic full-body humanoid robots without markers. Learning-based humanoid robot pose estimation methods [
14,
49] are unable to estimate 2D keypoints for full-body joints. Prior full-body human pose estimation methods [
3,
67,
68,
69] do not work with humanoid robot imagery.
Our 3D pose estimation method, denoted as
, uses head-worn stereo views, head tracking, and inertial measurements from the upper arms and wrists to estimate the full-body pose for XR overlay. We evaluate our method on the Humanoid Robot Pose (HRP) dataset (see
Section 4.1) and compare it against the baseline approaches detailed in
Section 5.2.1.
The evaluation is conducted in a lab environment under the same lighting conditions as the training set, with pose estimation performance assessed in world space (see
Section 5.2.2) and in the head-worn camera space (see
Section 5.2.3).
5.2.1. Baseline Approaches
The Motion Capture system, referred to as
, serves as a baseline approach that assumes all relevant information is consistently provided over time. The humanoid robot’s arm poses are given by motion capture data from inertial calibrated sensors, as described in
Section 3.3. The lower body pose, root rotation, and position are also provided using ground truth (GT) data from the first frame. However, this strong assumption limits the method’s applicability to specific robots and locations. In contrast,
requires no prior knowledge and can be applied to any type of robot or environment.
Smplify [
40] is a single-image-based method that estimates the full-body pose using 2D keypoints and the SMPL body model [
41]. For a fair comparison, we provide Smplify with keypoints detected by our joint detector (see
Section 5.1). However, Smplify’s single-image approach lacks depth information, resulting in significant inaccuracies in 3D body root position estimates and a lack of multiview consistency. To mitigate this, we replace Smplify’s raw body root positions with ground truth (GT) data (see
Section 4.1). We denote
for the left stereo image with GT root positions and
for the right stereo image with GT root positions.
Our 3D pose estimation runs at an average of 40 FPS on a desktop PC equipped with an AMD Ryzen 9 5900X processor (3.70 GHz), 64 GB RAM, and NVIDIA GeForce RTX 3080 Ti GPU. In comparison, takes approximately 45 s to process a single view.
5.2.2. Evaluation in World Space
We conducted evaluations using the HRP dataset’s evaluation set, as described in
Table 3, which contains 3D humanoid robot motions under different headset movements. Sequence 1 features minimal camera movements, while Sequence 2 involves rapid camera movements. The estimated full-body 3D pose of each method is retargeted to the SMPL body model [
41], and the results are overlaid from another viewpoint to demonstrate accuracy in 3D. The shape parameters of SMPL are estimated from the detected 3D joint positions using the method described in [
3].
Figure 9 presents qualitative comparisons of full-body pose estimation in world space between
,
, and
. Each frame selected from Sequence 2 shows an overlaid image from a fixed-world viewpoint for each method. For detailed comparative results over time, please refer to the
Supplemental Video.
To assess the accuracy of our 3D pose estimation, we compared its performance with baseline approaches using ground truth data, measuring 3D joint position and orientation errors in world space. The overall results for body part evaluations are presented in
Table 5 and
Table 6, with detailed average joint orientation and position errors for each bone in Sequence 1, Sequence 2, and overall.
In all sequences shown in
Table 5, our method consistently achieves significantly lower joint orientation errors compared with baseline approaches. While the
method uses an initial calibration between inertial sensors and the remote user in a straight pose, its arm orientation errors are notably higher than for the lower body. This discrepancy is primarily due to the lack of real-time adjustments in
, which leads to misalignments over time as the remote user and local humanoid robot perform similar motions.
In contrast, the arm orientation errors in
are significantly reduced, thanks to the continuous calibration updates provided by the visual–inertial alignment process described in
Section 4.3. This approach enables our method to outperform
, especially in terms of upper-body joint orientation accuracy. Additionally, our method estimates the lower-body pose with an accuracy comparable to
, despite not relying on external knowledge or ground truth (GT) data for root orientation and position. This achievement highlights the robustness of our system, as illustrated in
Figure 9.
and
exhibit substantially higher errors compared with
, primarily due to inaccurate root joint orientation estimation, especially when the robot is not facing forward or is off-center in the images. Since SMPLify is mainly trained on human bodies for shape and pose priors, the inaccuracies in SMPLify likely arise from significant differences between the joint placements of the humanoid robot and those of humans. Additionally, SMPLify displays considerable motion jitter over time, whereas the temporal stability of
can be attributed to our 3D joint estimation method, which leverages temporal information, as described in
Section 4.3.8.
In the evaluation of per-joint 3D position errors presented in
Table 6, our method, which utilizes visual–inertial alignments, demonstrates significantly higher accuracy for arm joints (elbows and wrists) compared to other baseline approaches. However, for torso and lower body joints (neck, shoulders, hips, knees, and ankles), our method shows slightly lower accuracy than the baseline Motion Capture (Mocap) system. The accuracy of these joints is highly dependent on the precision of the root transformation, as defined by Equation (
7). It is important to note that the root transformation of the Mocap system is derived from ground truth (GT) data, whereas our method estimates the root without relying on any external information.
Overall, the results in world space show that the full-body pose estimation by our method, which utilizes our own joint detector, maintains multi-view-consistent root joint estimations without using any prior knowledge, whereas and require substantial ground truth data for root joint.
5.2.3. Evaluation on Head-Worn Views
To evaluate the performance of the digital human overlay, we used the evaluation set of the HRP dataset, which includes 2D humanoid robot poses captured from moving stereo camera images across two test sequences. The estimated full-body 3D pose from each method is projected onto the head-worn camera’s viewpoint, with the results overlaid to assess the accuracy of the digital human overlay.
Figure 10 presents qualitative comparisons of full-body pose estimation in head-worn camera space between
,
, and
. Each frame corresponds to the frames in
Figure 9, displaying the overlaid left image from the stereo rig. For detailed comparative results over time, please refer to the
Supplemental Video.
To assess the accuracy of our 3D pose estimation for digital human overlay, we compared its performance against baseline approaches using ground truth data, measuring the Percentage of Correct Keypoints (PCK) in head-worn camera space. The overall results for body part evaluations are presented in
Table 7, detailing the mean PCK for each joint in Sequence 1, Sequence 2, and overall. The mPCK was computed across eight PCK thresholds (0.03, 0.04, …, 0.1).
In the per-joint 2D position error evaluation in
Table 7,
, utilizing visual–inertial alignments, demonstrates significantly higher accuracy for arm joints (elbows and wrists) compared with other baseline approaches. The 2D positions of the torso and lower body (neck, shoulders, hips, knees, and ankles) estimated by
show comparable accuracy to
.
The results in the head-worn view space indicate that our method consistently outperforms the baseline approaches, delivering the best digital human overlay results without relying on any prior knowledge, while both and require substantial ground truth data for accurate digital human overlay.
5.3. Application
In this section, we demonstrate a remote physical training (PT) scenario in Extended Reality (XR) to showcase the capability of our 3D telepresence system for real-time remote visual and physical interactions. The detailed system setup is described in
Section 3.3 and
Section 3.4.
The remote trainer guides the local user through physical training motions interactively, providing immediate audio feedback to enhance performance. Our system enables the remote trainer to control the humanoid robot using only body-worn sensors, eliminating the need for complex manipulation via external controllers. With wireless body-worn sensors and an XR headset, motion capture becomes location-independent, allowing free movement in any environment.
Additionally, our prototype robot is capable of grasping and manipulating objects using the SoftHand, as described in
Section 3.1. The remote user wears additional IMUs on both hands to control the robot’s grasping motions. When the user’s fingers grasp, the relative rotations between the wrist and hand sensors are detected. (See
Section 3.3).
Sample results of the remote robot’s operations, including grasping motions, are shown in
Figure 1 (bottom row, 1 × 3 image group). Further demonstrations of remote physical interactions, such as object handovers between the robot and the local user, can be viewed in the
Supplemental Video.
The local user interacts visually with the remote trainer through the overlaid digital human on the robot, enhancing the sense of realism and immersion.
Figure 11 presents selected frames of the digital human overlay as viewed through the head-mounted display worn by the local user. These results, captured during live demonstrations under lighting conditions different from those in the training data, highlight the system’s ability to function effectively in various environments.
5.4. Network Efficiency
Instead of transmitting a posed high-quality surface of the digital human in every frame, our pose estimation system wirelessly sends a much smaller-sized pose data packet to the XR headset for visual augmentation onto the humanoid robot, as illustrated in
Figure 2 and
Figure 7. The size of the network packet is crucial in determining overall system latency, ensuring efficient network usage for pose data transformation.
For network transmission, the full-body motion data (as described in
Section 4.3.6) are encoded into a fixed-sized packet, which includes the pose of the upper body, a binary grab indicator for both hands (as described in
Section 3.3), and the root transformation. The pose of the lower body remains fixed in the current prototype due to relaxed constraints, as outlined in
Section 4.3.8, and is therefore excluded from the packet.
In summary, the packet comprises the upper body pose (four bones for each arm represented in quaternions, totaling bytes), the binary grab indicator for both hands (four labels stored in one byte), and the root transformation (which includes the rotation in a quaternion and the 3D position, totaling bytes). This results in a total packet size of 24 bytes.
The use of small, fixed-length packets enhances network efficiency. In our current prototype system, each packet is transmitted from the pose server to the headset using TCP socket networking.
5.5. System Latency
Our prototype system utilizes various sensor data for real-time telepresence. The inertial sensor-based motion capture for the remote user is updated at 60 Hz, and the captured motion is transmitted wirelessly to the humanoid robot, which updates its pose at 120 Hz. Fast motions exhibit a delay of approximately 0.5 s in the robot’s teleoperation.
The remote operator uses video conferencing software [
56] to share the robot’s view and communicate with the local user. This also introduces a delay of about 0.5 s, which is the main source of latency in our system. In future versions, we plan to develop a dedicated video conferencing system to reduce this latency.
As the current XR headset [
53] cannot handle deep learning inference, we use an external processing machine in our setup. The pose estimation server receives motion capture data (60 Hz, wireless), stereo images (60 Hz, wired), and headset pose data (60 Hz, wireless) as input. The unsynchronized sensor data are processed whenever all three inputs are received by the server, with a frame offset of up to 16 ms in the current setup.
Our pose estimation system runs at 40 FPS (
Section 4.3) and continuously transmits the updated pose of the digital human to the XR headset. This external processing setup introduces a few frames of delay. In the next iteration, we plan to develop an on-device deep learning pose estimator, which we expect will significantly reduce wireless network delay.
6. Limitation and Future Work
The primary limitation of our approach lies in pose estimation inaccuracy, particularly for arm movements. The ray triangulation-based 3D joint estimation, as described in
Section 4.3.4, which relies on accurate 2D keypoints from stereo images, is sensitive to even small 2D errors, resulting in inaccurate 3D joint estimates. To address this, we plan to incorporate temporally consistent joint optimization as proposed in [
3].
Additionally, wireless delays in sensor measurements from VSLAM and IMUs further contribute to pose estimation inaccuracies, which can lead to misaligned XR overlay visualizations from the user’s perspective. In future versions, we aim to improve sensor synchronization to reduce system latency.
Our current system’s limitations highlight several areas for future research. One promising direction is developing an on-device pose estimator. The current XR headset lacks the capability to handle deep learning inference, necessitating an external machine and introducing network-related latencies. An on-device solution would enable the entire system, including deep learning inference, to run directly on the XR device, potentially leading to significant performance improvements.
Another area of interest is extending the approach to walking robots, which could enhance 3D telepresence and support remote work in outdoor environments. In future iterations, we plan to improve the mobility of the humanoid robot to enable remote working in a broader range of scenarios.
The current pose estimation method is designed to detect a single robot. Expanding this approach to support multiple remote working robots is another interesting direction for future research.
7. Conclusions
We introduced a real-time 3D telepresence system utilizing a humanoid robot as a step toward XR-based remote working. Our system facilitates physical interaction through the humanoid robot while providing visual immersion via an overlaid digital human avatar. To address the challenge of consistently synchronizing avatars, we proposed a markerless 3D pose estimation method specifically designed for humanoid robots, leveraging our newly collected dataset.
Our results demonstrate the robustness and consistency of this method in estimating full-body poses for humanoid robots using head-worn cameras, without relying on external knowledge such as the robot’s global position or full-body pose. We showcased the system’s potential through an application in remote physical training, highlighting the effectiveness of simultaneous visual and physical interactions using an XR headset.
We envision our prototype evolving into a fully mobile system, featuring an XR headset integrated with an on-device deep learning pose estimator. This development would eliminate the need for an external machine and significantly reduce system latency. Future iterations will aim to support both mobile and multiple humanoid robots, enhancing the utility and productivity of our telepresence prototype across various remote working tasks.
Author Contributions
Conceptualization, Y.L., H.L. and Y.C. (Youngwoon Cha); Methodology, Y.C. (Youngdae Cho) and Y.C. (Youngwoon Cha); Software, Y.C. (Youngdae Cho), W.S., J.B. and Y.C. (Youngwoon Cha); Validation, Y.C. (Youngdae Cho), H.L. and Y.C. (Youngwoon Cha); Formal analysis, Y.C. (Youngdae Cho) and Y.C. (Youngwoon Cha); Investigation, Y.C. (Youngdae Cho), W.S., J.B. and Y.C. (Youngwoon Cha); Resources, Y.L., H.L. and Y.C. (Youngwoon Cha); Data curation, Y.C. (Youngdae Cho), W.S. and J.B.; Writing—original draft, Y.C. (Youngdae Cho), J.B. and Y.C. (Youngwoon Cha); Writing—review & editing, Y.C. (Youngwoon Cha); Visualization, Y.C. (Youngdae Cho); Supervision, Y.L., H.L. and Y.C. (Youngwoon Cha); Project administration, Y.C. (Youngwoon Cha); Funding acquisition, Y.C. (Youngwoon Cha). All authors have read and agreed to the published version of the manuscript.
Funding
This work was partially supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2023-00214511), by Korea Institute of Science and Technology (KIST) Institutional Program (2E33003 and 2E33000), and by Institute of Information & communications Technology Planning & Evaluation (IITP) under the metaverse support program to nurture the best talents (IITP-2024-RS-2023-00256615) grant funded by the Korea government (MSIT).
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Orts-Escolano, S.; Rhemann, C.; Fanello, S.; Chang, W.; Kowdle, A.; Degtyarev, Y.; Kim, D.; Davidson, P.L.; Khamis, S.; Dou, M.; et al. Holoportation: Virtual 3d teleportation in real-time. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology, Tokyo, Japan, 16–19 October 2016; pp. 741–754. [Google Scholar]
- Cha, Y.W.; Price, T.; Wei, Z.; Lu, X.; Rewkowski, N.; Chabra, R.; Qin, Z.; Kim, H.; Su, Z.; Liu, Y.; et al. Towards fully mobile 3D face, body, and environment capture using only head-worn cameras. IEEE Trans. Vis. Comput. Graph. 2018, 24, 2993–3004. [Google Scholar] [CrossRef]
- Cha, Y.W.; Shaik, H.; Zhang, Q.; Feng, F.; State, A.; Ilie, A.; Fuchs, H. Mobile. Egocentric Human Body Motion Reconstruction Using Only Eyeglasses-mounted Cameras and a Few Body-worn Inertial Sensors. In Proceedings of the 2021 IEEE Virtual Reality and 3D User Interfaces (VR), Lisboa, Portugal, 27 March–1 April 2021; pp. 616–625. [Google Scholar]
- Kristoffersson, A.; Coradeschij, S.; Loutfi, A. A review of mobile robotic telepresence. Adv. Hum.-Comput. Interact. 2013, 2013, 902316. [Google Scholar] [CrossRef]
- Zhang, G.; Hansen, J.P. Telepresence robots for people with special needs: A systematic review. Int. J. Hum.-Comput. Interact. 2022, 38, 1651–1667. [Google Scholar] [CrossRef]
- Aymerich-Franch, L.; Petit, D.; Ganesh, G.; Kheddar, A. Object touch by a humanoid robot avatar induces haptic sensation in the real hand. J. Comput.-Mediat. Commun. 2017, 22, 215–230. [Google Scholar] [CrossRef]
- Bremner, P.; Celiktutan, O.; Gunes, H. Personality perception of robot avatar tele-operators. In Proceedings of the 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Christchurch, New Zealand, 7–10 March 2016; pp. 141–148. [Google Scholar]
- Luo, R.; Wang, C.; Schwarm, E.; Keil, C.; Mendoza, E.; Kaveti, P.; Alt, S.; Singh, H.; Padir, T.; Whitney, J.P. Towards robot avatars: Systems and methods for teleinteraction at avatar xprize semi-finals. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 7726–7733. [Google Scholar]
- Khatib, O.; Yeh, X.; Brantner, G.; Soe, B.; Kim, B.; Ganguly, S.; Stuart, H.; Wang, S.; Cutkosky, M.; Edsinger, A.; et al. Ocean one: A robotic avatar for oceanic discovery. IEEE Robot. Autom. Mag. 2016, 23, 20–29. [Google Scholar] [CrossRef]
- Hauser, K.; Watson, E.N.; Bae, J.; Bankston, J.; Behnke, S.; Borgia, B.; Catalano, M.G.; Dafarra, S.; van Erp, J.B.; Ferris, T.; et al. Analysis and perspectives on the ana avatar xprize competition. Int. J. Soc. Robot. 2024, 1–32. [Google Scholar] [CrossRef]
- Double. Available online: https://www.doublerobotics.com/ (accessed on 25 September 2024).
- Tejwani, R.; Ma, C.; Bonato, P.; Asada, H.H. An Avatar Robot Overlaid with the 3D Human Model of a Remote Operator. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 7061–7068. [Google Scholar] [CrossRef]
- Jones, B.; Zhang, Y.; Wong, P.N.; Rintel, S. Belonging there: VROOM-ing into the uncanny valley of XR telepresence. Proc. ACM Hum.-Comput. Interact. 2021, 5, 1–31. [Google Scholar] [CrossRef]
- Amini, A.; Farazi, H.; Behnke, S. Real-time pose estimation from images for multiple humanoid robots. In RoboCup 2021: Robot World Cup XXIV; Alami, R., Biswas, J., Cakmak, M., Obst, O., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 91–102. [Google Scholar]
- Latoschik, M.E.; Roth, D.; Gall, D.; Achenbach, J.; Waltemate, T.; Botsch, M. The effect of avatar realism in immersive social virtual realities. In Proceedings of the 23rd ACM Symposium on Virtual Reality Software and Technology, Gothenburg, Sweden, 8–10 November 2017; pp. 1–10. [Google Scholar]
- Choi, Y.; Lee, J.; Lee, S.H. Effects of locomotion style and body visibility of a telepresence avatar. In Proceedings of the 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), Atlanta, GA, USA, 22–26 March 2020; pp. 1–9. [Google Scholar]
- Aseeri, S.; Interrante, V. The Influence of Avatar Representation on Interpersonal Communication in Virtual Social Environments. IEEE Trans. Vis. Comput. Graph. 2021, 27, 2608–2617. [Google Scholar] [CrossRef]
- Fribourg, R.; Argelaguet, F.; Lécuyer, A.; Hoyet, L. Avatar and sense of embodiment: Studying the relative preference between appearance, control and point of view. IEEE Trans. Vis. Comput. Graph. 2020, 26, 2062–2072. [Google Scholar] [CrossRef]
- Liao, T.; Zhang, X.; Xiu, Y.; Yi, H.; Liu, X.; Qi, G.J.; Zhang, Y.; Wang, X.; Zhu, X.; Lei, Z. High-Fidelity Clothed Avatar Reconstruction From a Single Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 8662–8672. [Google Scholar]
- Zhao, X.; Wang, L.; Sun, J.; Zhang, H.; Suo, J.; Liu, Y. HAvatar: High-fidelity Head Avatar via Facial Model Conditioned Neural Radiance Field. ACM Trans. Graph. 2023, 43, 1–16. [Google Scholar] [CrossRef]
- Thies, J.; Zollhöfer, M.; Nießner, M.; Valgaerts, L.; Stamminger, M.; Theobalt, C. Real-time expression transfer for facial reenactment. ACM Trans. Graph. 2015, 34, 183-1. [Google Scholar] [CrossRef]
- Shen, K.; Guo, C.; Kaufmann, M.; Zarate, J.J.; Valentin, J.; Song, J.; Hilliges, O. X-Avatar: Expressive Human Avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 16911–16921. [Google Scholar]
- Gafni, G.; Thies, J.; Zollhofer, M.; Niessner, M. Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 8649–8658. [Google Scholar]
- Yu, K.; Gorbachev, G.; Eck, U.; Pankratz, F.; Navab, N.; Roth, D. Avatars for Teleconsultation: Effects of Avatar Embodiment Techniques on User Perception in 3D Asymmetric Telepresence. IEEE Trans. Vis. Comput. Graph. 2021, 27, 4129–4139. [Google Scholar] [CrossRef] [PubMed]
- Panda, P.; Nicholas, M.J.; Gonzalez-Franco, M.; Inkpen, K.; Ofek, E.; Cutler, R.; Hinckley, K.; Lanier, J. Alltogether: Effect of avatars in mixed-modality conferencing environments. In Proceedings of the 1st Annual Meeting of the Symposium on Human-Computer Interaction for Work, Durham, NH, USA, 8–9 June 2022; pp. 1–10. [Google Scholar]
- Qiu, H.; Streli, P.; Luong, T.; Gebhardt, C.; Holz, C. ViGather: Inclusive Virtual Conferencing with a Joint Experience Across Traditional Screen Devices and Mixed Reality Headsets. Proc. ACM Hum.-Comput. Interact. 2023, 7, 1–27. [Google Scholar] [CrossRef]
- Tachi, S.; Kawakami, N.; Nii, H.; Watanabe, K.; Minamizawa, K. Telesarphone: Mutual telexistence master-slave communication system based on retroreflective projection technology. SICE J. Control Meas. Syst. Integr. 2008, 1, 335–344. [Google Scholar] [CrossRef]
- Fernando, C.L.; Furukawa, M.; Kurogi, T.; Kamuro, S.; Minamizawa, K.; Tachi, S. Design of TELESAR V for transferring bodily consciousness in telexistence. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 5112–5118. [Google Scholar]
- Steed, A.; Steptoe, W.; Oyekoya, W.; Pece, F.; Weyrich, T.; Kautz, J.; Friedman, D.; Peer, A.; Solazzi, M.; Tecchia, F.; et al. Beaming: An asymmetric telepresence system. IEEE Comput. Graph. Appl. 2012, 32, 10–17. [Google Scholar] [CrossRef] [PubMed]
- Hilty, D.M.; Randhawa, K.; Maheu, M.M.; McKean, A.J.; Pantera, R.; Mishkind, M.C.; Rizzo, A.S. A review of telepresence, virtual reality, and augmented reality applied to clinical care. J. Technol. Behav. Sci. 2020, 5, 178–205. [Google Scholar] [CrossRef]
- Tsui, K.M.; Desai, M.; Yanco, H.A.; Uhlik, C. Exploring use cases for telepresence robots. In HRI ’11, Proceedings of the 6th International Conference on Human-Robot Interaction, Lausanne, Switzerland, 6–9 March 2011; Association for Computing Machinery: New York, NY, USA, 2011; pp. 11–18. [Google Scholar] [CrossRef]
- Schwarz, M.; Lenz, C.; Rochow, A.; Schreiber, M.; Behnke, S. NimbRo Avatar: Interactive Immersive Telepresence with Force-Feedback Telemanipulation. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 5312–5319. [Google Scholar] [CrossRef]
- Lenz, C.; Behnke, S. Bimanual telemanipulation with force and haptic feedback through an anthropomorphic avatar system. Robot. Auton. Syst. 2023, 161, 104338. [Google Scholar] [CrossRef]
- Toshev, A.; Szegedy, C. DeepPose: Human Pose Estimation via Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; Schiele, B. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In Computer Vision—ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part VI 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 34–50. [Google Scholar]
- Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Pavlakos, G.; Zhu, L.; Zhou, X.; Daniilidis, K. Learning to Estimate 3D Human Pose and Shape From a Single Color Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A.A.A.; Tzionas, D.; Black, M.J. Expressive Body Capture: 3D Hands, Face, and Body From a Single Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Kocabas, M.; Athanasiou, N.; Black, M.J. VIBE: Video Inference for Human Body Pose and Shape Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Bogo, F.; Kanazawa, A.; Lassner, C.; Gehler, P.; Romero, J.; Black, M.J. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Computer Vision—ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part V 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 561–578. [Google Scholar]
- Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graph. (TOG) 2015, 34, 1–16. [Google Scholar] [CrossRef]
- von Marcard, T.; Rosenhahn, B.; Black, M.; Pons-Moll, G. Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs. Comput. Graph. Forum 2017, 36, 349–360. [Google Scholar] [CrossRef]
- Huang, Y.; Kaufmann, M.; Aksan, E.; Black, M.J.; Hilliges, O.; Pons-Moll, G. Deep Inertial Poser Learning to Reconstruct Human Pose from SparseInertial Measurements in Real Time. ACM Trans. Graph. (TOG) 2018, 37, 1–15. [Google Scholar]
- von Marcard, T.; Henschel, R.; Black, M.; Rosenhahn, B.; Pons-Moll, G. Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Computer Vision—ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part V 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3686–3693. [Google Scholar]
- Miseikis, J.; Knobelreiter, P.; Brijacak, I.; Yahyanejad, S.; Glette, K.; Elle, O.J.; Torresen, J. Robot localisation and 3D position estimation using a free-moving camera and cascaded convolutional neural networks. In Proceedings of the 2018 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Auckland, New Zealand, 9–12 July 2018; pp. 181–187. [Google Scholar]
- Lee, T.E.; Tremblay, J.; To, T.; Cheng, J.; Mosier, T.; Kroemer, O.; Fox, D.; Birchfield, S. Camera-to-robot pose estimation from a single image. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 9426–9432. [Google Scholar]
- Lu, J.; Richter, F.; Yip, M.C. Pose estimation for robot manipulators via keypoint optimization and sim-to-real transfer. IEEE Robot. Autom. Lett. 2022, 7, 4622–4629. [Google Scholar] [CrossRef]
- qb SoftHand Research. Available online: https://qbrobotics.com/product/qb-softhand-research/ (accessed on 25 September 2024).
- Blender. Available online: https://www.blender.org/ (accessed on 25 September 2024).
- Apple Vision Pro. Available online: https://www.apple.com/apple-vision-pro/ (accessed on 25 September 2024).
- XReal Light. Available online: https://www.xreal.com/light/ (accessed on 25 September 2024).
- Xsens MTw Ainda. Available online: https://www.movella.com/products/wearables/xsens-mtw-awinda/ (accessed on 25 September 2024).
- GoPro. Available online: https://gopro.com/ (accessed on 25 September 2024).
- Zoom. Available online: https://zoom.us/ (accessed on 25 September 2024).
- Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
- Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Computer Vision—ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part VIII 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 483–499. [Google Scholar]
- Nibali, A.; He, Z.; Morgan, S.; Prendergast, L. Numerical coordinate regression with convolutional neural networks. arXiv 2018, arXiv:1801.07372. [Google Scholar]
- Available online: https://darwin.v7labs.com/ (accessed on 25 September 2024).
- Sumikura, S.; Shibuya, M.; Sakurada, K. OpenVSLAM: A versatile visual SLAM framework. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2292–2295. [Google Scholar]
- Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. Vitpose: Simple vision transformer baselines for human pose estimation. Adv. Neural Inf. Process. Syst. 2022, 35, 38571–38584. [Google Scholar]
- Lovanshi, M.; Tiwari, V. Human pose estimation: Benchmarking deep learning-based methods. In Proceedings of the 2022 IEEE Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation (IATMSI), Gwalior, India, 21–23 December 2022; pp. 1–6. [Google Scholar]
- Besl, P.J.; McKay, N.D. Method for registration of 3-D shapes. In Sensor Fusion IV: Control Paradigms and Data Structures; Schenker, P.S., Ed.; International Society for Optics and Photonics, SPIE: Bellingham, WA, USA, 1992; Volume 1611, pp. 586–606. [Google Scholar] [CrossRef]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
- Artacho, B.; Savakis, A. Omnipose: A multi-scale framework for multi-person pose estimation. arXiv 2021, arXiv:2103.10180. [Google Scholar]
- Shimada, S.; Golyanik, V.; Xu, W.; Theobalt, C. Physcap: Physically plausible monocular 3d motion capture in real time. ACM Trans. Graph. (ToG) 2020, 39, 1–16. [Google Scholar] [CrossRef]
- Yi, X.; Zhou, Y.; Habermann, M.; Golyanik, V.; Pan, S.; Theobalt, C.; Xu, F. EgoLocate: Real-time motion capture, localization, and mapping with sparse body-mounted sensors. ACM Trans. Graph. (TOG) 2023, 42, 1–17. [Google Scholar] [CrossRef]
- Winkler, A.; Won, J.; Ye, Y. Questsim: Human motion tracking from sparse sensors with simulated avatars. In Proceedings of the SIGGRAPH Asia 2022 Conference Papers, Daegu, Republic of Korea, 6–9 December 2022; pp. 1–8. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).