Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu
Diffusing in Someone Else’s Shoes: Robotic Perspective Taking with Diffusion arXiv:2404.07735v1 [cs.RO] 11 Apr 2024 Josua Spisak1 , Matthias Kerzel2 , Stefan Wermter1 Abstract— Humanoid robots can benefit from their similarity to the human shape by learning from humans. When humans teach other humans how to perform actions, they often demonstrate the actions and the learning human can try to imitate the demonstration. Being able to mentally transfer from a demonstration seen from a third-person perspective to how it should look from a first-person perspective is fundamental for this ability in humans. As this is a challenging task, it is often simplified for robots by creating a demonstration in the first-person perspective. Creating these demonstrations requires more effort but allows for an easier imitation. We introduce a novel diffusion model aimed at enabling the robot to directly learn from the third-person demonstrations. Our model is capable of learning and generating the first-person perspective from the third-person perspective by translating the size and rotations of objects and the environment between two perspectives. This allows us to utilise the benefits of easy-toproduce third-person demonstrations and easy-to-imitate firstperson demonstrations. The model can either represent the first-person perspective in an RGB image or calculate the joint values. Our approach significantly outperforms other image-toimage models in this task. I. I NTRODUCTION Imitation Learning or Learning from Demonstrations (LfD) is a learning mechanism often encountered in humans because of how direct a demonstration can transmit information. If an action is taught by explaining it, a further level of abstraction is needed, which direct demonstration can bypass. However, it does require other abilities that can be found in humans but are more complex than they might seem. When learning from a demonstration, correctly perceiving the demonstration is the first step. Then, this perceived action has to be transferred to ourselves so that we can imitate it. One of the necessary abilities for this process is perspective-taking. Visual and spatial perspective-taking is the ability to see someone and imagine what they are seeing. This ability requires an understanding of the objects that are in someone’s field of view and understanding that they will see the objects differently. The objects will have to be transferred from the position, shape, and size one person is seeing them to the position, shape, and size another person is seeing them. In humans this ability develops between the *The authors gratefully acknowledge support from the DFG (CML, MoReSpace, LeCAREbot), BMWK (VERIKAS), and the European Commission (TRAIL, TERAIS). 1 Josua Spisak and Stefan Wermter are with the Knowledge Technology (WTM) group, Department of Informatics, University of Hamburg, 22527 Hamburg, Germany josua.spisak@uni-hamburg.de 2 Matthias Kerzel is with HITeC, 22527 Hamburg, Germany Fig. 1. On the left, the figure shows a demonstration in the simulation and the resulting imitation on the real robot on the right. ages 3-4 [1]. In this paper, we show that diffusion models can also learn perspective-taking. This capability allows our approach to directly find the joint configurations necessary to imitate behaviour as seen from a third-person view. It combines the need to perceive the demonstration and to transfer it to our perspective and even body when predicting joint values. Apart from directly learning from imitation, our approach can also transfer images from the third-person view to the first-person view. This functionality can be used to generate data for the many existing imitation learning approaches that already exist. ”Unfortunately, hitherto imitation learning methods tend to require that demonstrations are supplied in the first-person [...] While powerful, this kind of imitation learning is limited by the relatively hard problem of collecting first-person demonstrations” Stadie et al. [2]. Our approach is able to generate the first-person view from the third-person view, thereby removing the challenge of collecting first-person view data. We also attempt to directly imitate the demonstrations from the third-person view. The field of image generation has been quickly improving over the past years through GANs and diffusion models [3], [4], [5], [6]. To generate the first-person view, we make use of the diffusion models in particular. Diffusion models work by purposely putting noise on a desired output and then iterative denoising it again. Through this process, the model learns to generate the desired output from pure noise by detecting general patterns in the desired outputs. The model also learns to build upon its last prediction as the denoising process is iterative, so it also develops the ability to perceive the given images quite well. To direct the generated output, a condition can be added to the diffusion model. This can be a label or a prompt. However, diffusion has also been used for tasks in which the condition is what would be the input in classical models, such as an image for object detection [7]. We follow this trend of having a high-dimensional and complex condition, that should result in a specific generated image. The condition displaying the third-person image is given to the model in various stages to allow it to generate a fitting first-person image. Our main contributions are: • A novel architecture to generate images in first-person perspective from images in third-person perspective for a humanoid robot. 1 • Our model outperforms competing approaches such as pix2pix and CycleGAN. • The approach can directly infer and translate the firstperson pose from the third-person image as shown in Fig. 1., through small modifications to the architecture. • We publish a new dataset with pairs of images for thirdperson and first-person perspectives.2 II. R ELATED W ORK Isola et al. [8] demonstrated that generative image-toimage models can solve a multitude of tasks. Ranging from changing an image from black and white to coloured, to generating an image just from the edges or turning segmentation labels into a scene image. They achieved this using conditional Generative Adversarial Networks (GANs). Further advancements in image-to-image models have been made by incorporating special losses such as cycle consistency loss, which ensures specific outputs for given inputs [9]. Going from the third-person perspective to the first-person perspective can also be done using their model, the results tend to be not as good as for some of the other tasks. Specialised models have been developed for this specific task of perspective-taking. To focus the model on a general understanding of the demonstration, the kernel sizes of the convolutional layers were increased, and self-attentive layers were added to help with understanding the relations between different parts of the input images. As a demonstration is often a continuous video, past frames can be used to improve the performance [10]. In a more general approach towards perspective transfer without the direct relation to robots and imitation learning, Liu et al. proposed the Parallel GAN (P-GAN) architecture [11]. By doing both a transfer from first-person to third-person and, at the same time, a transfer from third-person to first-person, they were able to outperform other models in this task, such as X-Seq or X-Fork [12] where the GANs were not working in parallel. In base diffusion models, a neural network architecture is used that starts with an image input, encodes it and 1 We plan to publish our code at hamburg.de/en/inst/ab/wtm/research/software.html 2 We plan to publish our dataset at hamburg.de/en/inst/ab/wtm/research/corpora.html https://www.inf.unihttps://www.inf.uni- decodes it again to reconstruct it. The key is that different levels of noise are put on the input image. The model, therefore, has to learn to either predict which parts of the image are noise, or what it should look like without the noise [3]. Once trained, the model can iteratively generate images by progressively removing noise levels encountered during training. As diffusion models possess a similar purpose to GANs in generating images, it does not come as a surprise that they can be used for many similar tasks, including image-to-image translations. Pix2Pix-zero is one such model, which is able to do zero-shot image translation by changing an image along an ”edit direction” [9]. Other diffusion approaches focus more on generating images from text prompts, layouts or semantic synthesis [13]. Similar to GANs, diffusion models can also be conditioned to generate specified output. In diffusion models, there is always one natural way of condition, which is the time step. This is a parameter that tells the network what level of noise it can expect so that it can adjust to that. Additional conditions, such as a text embedding to direct the output, can be added to this time step [14]. The conditioning has also been used for tasks a bit further away from generating images, such as action segmentation [15] or object detection [7]. Here, the output is the bounding box or frame label for the condition input. These works use an encoder-decoder structure instead of a U-Net, and the condition is added to the input of the decoder layers. When it comes to imitation, movements play a big role. For robots, Inverse Kinematics solvers tackle the task of moving kinematic body parts to predefined targets, a problem that can quickly increase in complexity with the Degrees of Freedom (DoF) the body part possesses [16]. There can be infinite solutions for one inverse kinematic problem, making inverse kinematics very challenging [17]. Deep learning methods, including generative models, have also been adapted to this task, generating the movement necessary to reach the defined position. Similar to imageto-image tasks, the relation between the in- and output is crucial, making losses like the cycle consistency loss useful for this task [18]. Some approaches have also experimented with further changes to the GAN architecture, changing the input, the costs, the output or ensembling multiple networks [19]. Learning from demonstration is utilised to get past many challenges of machine learning. From a demonstration, much information about how to solve a task can be gained, and the demonstrated solution can be used to recreate that success. Imitation learning can be used to allow robots to learn movement planning. As going from RBG images to joints can be difficult, different kinds of observing the demonstrations have been used, such as voxels [20] or point clouds [21]. Although, for human data, pose detectors exist that are able to find key points of the poses directly from RGB images [22], demonstrations are often performed from a first-person perspective so that trajectories stay similar and the demonstration is closer to the imitation. To directly learn from a third-person perspective, the trajectories have to be translated to the first-person perspective, which can be achieved with the use of generative models [2]. Even just mirroring a pose seen in a mirror can be very challenging for a robot. Hart and Scassellati developed an approach that makes use of six models, an end-effector model, a perceptual model, a perspective-taking model, a structural model, an appearance model and a functional model for this challenge. Their loss was six times higher for demonstrations from third-person view compared to demonstrations from first-person view.[23]. III. M ETHODS Our approach is based on a conditioned diffusion model. In contrast to the model from Ho et al. [3], we use a full image as our condition, akin to how DiffusionDet [7] or Diffusion Action Segmentation [15] made use of diffusion models. The condition is an image captured from a camera opposite the robot, representing how a person sitting across from it would see the robot. The model then generates the scene as seen by the robot. As the approach uses diffusion, we also have another input, which during inference is simply noise and during training consists of the desired output with noise applied. The level of noise depends on a randomly chosen time step between 0 and 1000 for each sample. A cosine noise schedule calculates the amount of noise depending on the chosen time step [24]. The cosine schedule allows for a smoother transition from no noise to full noise compared to a linear noise schedule. While the network should mostly rely on the condition, the smoother transition can increase the benefits of the iterative denoising process. The architecture of our model is shown in Fig. 2. The structure of the model largely follows a U-Net [25], however, there are no latent connections between the primary downward and upward leading paths of the U-Net. Instead, we added a secondary downward leading path that has the purpose of encoding our condition. This path is only connected through the latent connections to the rest of the architecture. Our model directly predicts the desired output, instead of trying to predict the noise, as proposed by Song et al. [4]. This decision was made as our goal lies more in correct predictions and accuracy than in the model’s ability to generate new ideas or varying outputs, as could be desired when generating images from prompts. During inference, we start with using randomly generated noise for our input and the condition, which is the RBG image taken from a different perspective providing the information needed to replicate the pose seen in it. Predicting the desired result allows us to utilise only 50 denoising steps, significantly accelerating the inference. Our model leverages both the information provided by the condition and that acquired during later Fig. 2. Our architecture has two inputs: the condition, which is the thirdperson view of the robot in the top middle and the noised output during training or, as seen in the top left image, simply noise during inference. Then, the top left image shows the noise that is the input during inference, and the top right image shows the output of our model, the first-person view from the robot. Both input images are decoded, each through one of the downward paths. The decoded version of the noise is directly led into the upward path that decodes the image to the output, while the encoded version of the condition is led into the upward path through latent connections. steps of the iterative denoising process to generate its output. Both paths follow the same structure of convolution layers, attention layers and maxpooling layers. The first path ends in two convolution layers that do not exist in the second downward path. The upward path consists of convolutional layers, self-attention layers and upsampling layers in order to get back to the dimensions of the original input. The latent connections from the second downward path to the upward path always go from the self-attention layer to the upsampling layers. The time step is encoded and given in the linear layers of the ”Down” and ”Up” modules. We can modify our architecture by adding one linear layer in order to encode the input from the form of the joint values, which are two arrays of length 13, to the same dimensions as our conditioning image, which is an RBG image with a resolution of 64 × 64 pixels. Additionally, one linear decoding layer is appended at the end, to convert the dimensions from 64 × 64 × 3 into 2 × 13 joint values. This allows us to generate joint states instead of an RGB image. To create our dataset, we made use of the NICOL simulation introduced by Kerzel et al. [26]. We use two cameras in the simulation, one situated in the head of the robot and the second one positioned in front of it at a distance that allows us to fully capture the robot. The robot’s head is tilted downwards to ensure the actions are captured in the first-person view. The NICOL robot is a semi-humanoid robot without legs integrated into a collaborative workspace. We make the robot randomly assume poses for both of its arms. In each of the kinematic chains representing the arms, we have 13 joints. Five of these joints are in the hands, while the other eight are in the arms. The joint values range from minus pi to pi, which we scale to be between 0 and 1 for the training. After the poses have been taken, we take one image from both cameras and record the joint values from the robot. The end-effector position and rotation are also recorded in seven values for each of the hands, with the first three values representing the x, y, and z coordinates and the other four values representing the orientation. During the data collection, 10000 samples were recorded. For training, 80% of the dataset is used, while 20% is used for validation. All the images are RBG and have a height and a width of 64 pixels. The random poses come from a uniform distribution but are limited to stay in the visual range and within the ranges of the joints. There are no images where the arms are behind the robot’s back or otherwise obscured. The finger joints are always kept in the same position, so only eight of the thirteen recorded joint values per arm are changed. We have tried out multiple preprocessing steps, such as rotating, flipping, normalising and masking parts of the image. The network was trained for 100 epochs; as mentioned earlier, the noise applied to each training sample is chosen randomly, and there are 1000 different levels of noise possible. This extends the variance of our dataset, as the model is supposed to predict the same result regardless of the level of noise that was applied to the input. We utilised the Adam optimiser and the MSE loss function. Fig. 3. One example of predicting the first-person view, where the direct prediction differs from the prediction after iterative denoising. In the direct prediction, two thumbs can be identified on the hand, whereas only one is left after the iterative denoising, although in the wrong position. In this end-to-end approach, the network manages to identify the three-dimensional pose of the robot from a twodimensional RBG image and to transpose that pose from a third-person perspective to an egocentric perspective, which can be expressed through a generated RGB image or joint states. IV. R ESULTS We ran multiple versions of our model to evaluate the strengths and weaknesses. First off, we have the image-toimage model that transfers an image from the third-person perspective to the first-person perspective. Both qualitative and quantitative evaluations were performed. We compare the image generation part of our approach to the pix2pix [8] model and the CycleGAN [9] model. We make use of the implementation provided by the authors of these papers3 and use the standard configurations for both models to train them on our dataset. We evaluate both of these models as well as our own on three metrics with our test set. The metrics used are mean square error (MSE), L1 norm (L1) and the structural similarity index measure (SSIM) [27]. For MSE and L1, smaller is better, while the opposite is true for the SSIM. Our model outperformed both comparison models across all three metrics, as shown in Tab. I. Qualitative analysis, Fig. 4, shows that CycleGan seems to be unable to correctly adjust to the condition, always 3 https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix/tree/master Fig. 4. Qualitative comparisons between our model and pix2pix as well as CycleGAN. Five examples (A,B,C,D,E) are shown to illustrate the differences between the models. CycleGan has almost no variance between its output, pix2pix mostly seems to have the correct poses except for example C, but often is unable to correctly recreate the hands and finger structure. Our model always has the correct pose and recreates the hands in a lower resolution. TABLE I E VALUATION AND COMPARISON OF OUR MODEL , PIX 2 PIX AND C YCLE GAN. Model CycleGAN pix2pix Ours MSE 0.0221 0.0252 0.0007 L1 0.0815 0.0692 0.0086 SSIM 0.6482 0.7134 0.9773 generating the same pose for the first-person perspective. Pix2pix goes one step further with the generated image showing an understanding of the translation from third- to first-person. Compared to our result, however, the hands are often impossible to make out; there are still significant differences in the position and size of the arms and in some cases, artefacts appear in the image. Our results only seem to differ in the resolution compared to the goal images taken directly from the simulation. The hands and arms are in the correct positions, with even the fingers being clearly visible and discernible. For the iterative denoising, we tried out doing 0 steps, 50 steps, 200 steps or the full 1000 steps. In most of our results, the differences between the goal and the image resulting either from a direct prediction or from a prediction Fig. 5. Joint predictions from third-person image. after going through any number of iterative denoising steps are seemingly impossible to make out, with the prediction from the first step already looking very close to the ground truth. We did, however, come across one image, as seen in Fig. 3, in which finding the right position of the thumb was difficult for the model, resulting in two thumbs for the direct prediction, which was reduced to one thumb after the denoising process although in the wrong position. There are also cases where some of the blur from the goal images captured in the simulation was smoothed out in the predictions. When it comes to predicting the joints, we have again shown both quantitative, through the mean square error on our validation set, and qualitative results, by showing images of the poses resulting from the predicted joints. Our average mean squared error over the validation set is 27e-4 for the joints. It appears that the model finds it challenging to generalise from the training set, where we are able to get to a loss as low as 3.3e-6. While the general configuration seems to go in the correct direction, often, some of the joints are not completely accurate, shifting the whole arm in a wrong direction. While not as good as the image-generating part, the model still demonstrated an ability to infer a relationship between the condition and the goal joint configuration as seen in Fig. 5. We combined the results introduced so far, by training our joint position model on the first-person images from the demonstration. As the first-person images are supposed to be easier to learn from, and we are capable of generating them, it makes sense to see if learning the joint values from them is Fig. 6. One example of predicting joint values from the first-person view, where the imitation looks similar from the first-person view while looking very different from the third-person view. easier. By changing the input images, we managed to reduce the training error to 3e-7 and the validation error to 0.0017, clearly showing the use of the first-person perspective. Augmenting the input data with rotations, flips or maskings was able to further improve the validation error to 0.0014 where we reached a limit. One example of the results from using first-person images for the condition is shown in Fig. 6. It seems to be more difficult to directly predict the joints instead of recreating the first-person image, despite both requiring a spatial understanding through the third-person image. One possible reason for that is likely the redundancy in the arm. Allowing for many poses that look similar while having very different joint configurations. Fig. 6 shows one particular interesting result demonstrating this, where the first-person view looks very similar, while the actual pose looks quite different. With so many very similar looking poses, the model tends to predict values closer to the mean. To evaluate the predicted joints in more detail, Tab. II shows the mean square error for each of the joints in the arm; we did not evaluate the five-finger joint values. All versions of our model have an error far below the standard deviation of the dataset. The individual errors also show a very close relation to the existing standard deviation. Even though different parts of the robot are visible in the firstperson view compared to the third-person view the different models do not show a large discrepancy between the visible and non-visible parts. TABLE II T HE MEAN SQUARED ERROR FOR EACH OF THE JOINT VALUES . A LL THE VALUES ARE E -4. T HE PREDICTION ERROR IS CORRELATING WITH THE STANDARD DEVIATION OF THE GIVEN JOINT. Model Left arm dataset std third-person first-person augmented Right arm dataset std third-person first-person augmented j1 j2 j3 j4 j5 j6 j7 j8 297 14 12 11 762 79 71 67 914 111 95 82 180 4 4 3 436 28 24 20 153 3 3 3 266 11 10 7 294 13 10 9 271 11 10 16 723 69 61 52 860 96 85 78 114 2 2 2 307 14 13 10 119 2 2 2 279 12 10 18 390 22 19 16 V. C ONCLUSION We have presented that we can generate images in first-person perspective from a third-person perspective that are close to indistinguishable from the real first-person perspective, surpassing established approaches such as pix2pix or CycleGAN. While significant progress has been made, challenges remain in imitation learning from a thirdperson perspective. We believe that adding a pose detector to our model to get a more direct representation of the demonstrated pose can be a fruitful improvement, especially when extending the approach to the real world. Adding more three-dimensional information like this and making use of models focused on detecting arm poses should make it easier for our model to understand which specific joint configuration is needed. Another way of continuing this work would be to add an already existing approach that uses image data from an ego perspective to imitate poses. Throughout this paper, we have demonstrated that our approach effectively learns spatial relations and can resize and rotate objects to adjust them to a new perspective. Our results also highlight the utility of first-person images in learning joint positions, which significantly reduced our loss compared to third-person images. Additionally, we have demonstrated the potential of diffusion models for pose estimation and third-person imitation. R EFERENCES [1] S., Andrew, I. Apperly, and D. Samson. ”The use of embodied selfrotation for visual and spatial perspective-taking.” Frontiers in Human Neuroscience Vol. 7, P. 698, 2013 [2] B. C. Stadie, P. Abbeel, and I. Sutskever. ”Third-Person Imitation Learning.” arXiv preprint arXiv:1703.01703 2017. [3] J. Ho, A. Jain, and P. Abbeel. ”Denoising Diffusion Probabilistic Models.” Advances in Neural Information Processing Systems (NIPS) Vol. 33, P. 6840-6851, 2020 [4] J. Song, C. Meng, and S. Ermon. ”Denoising diffusion implicit models.” arXiv preprint arXiv:2010.02502 (2020). [5] B. Zhang, S. Gu, B. Thang, J. Bao, D. Chen, F. Wen, Y. Wang and B. Guo. ”Styleswin: Transformer-based GAN for High-resolution Image Generation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). P. 11304-11314 2022. [6] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishking, B. McGrew, I. Sutskever and M. Chen. ”GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models .” arXiv preprint arXiv:2112.10741 2021. [7] S. Che, P. Sun, Y. Song, and P. Luo ”DiffusionDet: Diffusion Model for Object Detection.” Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). P. 19830-19843, 2023. [8] P. Isola, J. Zhu, T. Zhou and A. A. Efros. ”Image-to-Image Translation with Conditional Adversarial Networks.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). P.1125-1134, 2017. [9] J. Zhu, T. Park, P. Isola and A. A. Efros. ”Unpaired Image-toImage Translation using Cycle-Consistent Adversarial Networks.” Proceedings of the IEEE International Conference on Computer Vision (ICCV). P. 2223-2232, 2017. [10] L. Garello, F. Rea, N. Noceti and A. Sciutti. ”Towards Third-Person Visual Imitation Learning Using Generative Adversarial Networks.” IEEE International Conference on Development and Learning (ICDL) P.121-126, 2022. [11] G. Liu, H. Latapie, O. Kilic and A. Lawrence. ”Parallel Generative Adversarial Network for Third-person to First-person Image Generation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). P.1917-1923, 2022. [12] K. Regmi and A. Borji. ”Cross-View Image Synthesis using Conditional GANs. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), P.3501-3510 2018 [13] R. Rombach A. Blattmann, L. Dominik, P. Esser and B. Omer. ”High-Resolution Image Synthesis with Latent Diffusion Models.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). P.10684-10695, 2022. [14] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. G. Lopes, B. K. Ayan, Z. Salimans and others. ”Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding.” Advances in Neural Information Processing Systems (NIPS) Vol. 35, P. 36479-36494, 2022 [15] D. Liu, Q. Li, A. Dinh, T. Jiang, M. Shah and C. Xu ”Diffusion Action Segmentation.” Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). P.10139-10149, 2023. [16] A. Aristiduo, J. Lasenby, Y. Chrysanthou and A. Shamir. ”Inverse Kinematics Techniques in Computer Graphics: A Survey.” Computer graphics forum, Vol. 37, No. 6, P. 35-58, 2018. [17] B. Siciliano and O. Kathib, Handbook of Robotics. Springer-Verlag Vol. 200, 2008. [18] J. Habekost, E. Strahl, P. Algeuer, M. Kerzel and S. Wermter ”CycleIK: Neuro-inspired Inverse Kinematics.” International Conference on Artificial Neural Networks (ICANN). Springer Nature Switzerland, P.457-470, 2023. [19] T. S. Lembono, E. Oignat, J. Jankowski and S. Calinon. ”Learning Constrained Distributions of Robot Configurations with Generative Adversarial Network.” IEEE Robotics and Automation Letters Vol. 6.2 P. 4233-4240, 2021 [20] S. Mohit, S. Manuelli, and D. Fox. ”PERCEIVER-ACTOR: A MultiTask Transformer for Robotic Manipulation.” Conference on Robot Learning (CoRL). PMLR, Vol. 205, P. 785-799, 2023. [21] S. Song, A. Zeng, J. Lee and T. Funkhouser. ”Grasping in the Wild: Learning 6DoF Closed-Loop Grasping from Low-Cost Demonstrations.” IEEE Robotics and Automation Letters, Vol. 5.3, P. 4978-4985, 2020 [22] Y. Xu, J. Zhang, Q. Zhang and D. Tao. ”ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation.” Advances in Neural Information Processing Systems (NIPS), Vol. 35, P. 3857138584, 2022 [23] J. Hart, and B. Scassellati. ”Mirror Perspective-Taking with a Humanoid Robot.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 26.1, P.1990-1996, 2012. [24] A. Nichol and P. Dhariwal. ”Improved Denoising Diffusion Probabilistic Models.” Proceedings of Machine Learning Research (PMLR). P. 8162-8171, 2021. [25] R., Olaf, P. Fischer, and T. Brox. ”U-Net: Convolutional Networks for Biomedical Image Segmentation.” Proceedings of Medical Image Computing and Computer-Assisted Intervention (MICCAI), Springer International Publishing, P.234-241, 2015. [26] M. Kerzel, P. Algeuer, E. Strahl, N. Frick, J. Habekost, M. Eppe and S. Wermter. ”NICOL: A Neuro-inspired Collaborative Semi-humanoid Robot that Bridges Social Interaction and Reliable Manipulation.” IEEE access, Vol. 11, P. 123531-123542, 2023. [27] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli. ”Image Quality Assessment: From Error Visibility to Structural Similarity.” IEEE Transactions on Image Processing Vol. 13.4, P. 600-612 2004