Diffusing in Someone Else’s Shoes:
Robotic Perspective Taking with Diffusion
arXiv:2404.07735v1 [cs.RO] 11 Apr 2024
Josua Spisak1 , Matthias Kerzel2 , Stefan Wermter1
Abstract— Humanoid robots can benefit from their similarity
to the human shape by learning from humans. When humans
teach other humans how to perform actions, they often demonstrate the actions and the learning human can try to imitate
the demonstration. Being able to mentally transfer from a
demonstration seen from a third-person perspective to how
it should look from a first-person perspective is fundamental
for this ability in humans. As this is a challenging task, it
is often simplified for robots by creating a demonstration
in the first-person perspective. Creating these demonstrations
requires more effort but allows for an easier imitation. We
introduce a novel diffusion model aimed at enabling the robot
to directly learn from the third-person demonstrations. Our
model is capable of learning and generating the first-person
perspective from the third-person perspective by translating
the size and rotations of objects and the environment between
two perspectives. This allows us to utilise the benefits of easy-toproduce third-person demonstrations and easy-to-imitate firstperson demonstrations. The model can either represent the
first-person perspective in an RGB image or calculate the joint
values. Our approach significantly outperforms other image-toimage models in this task.
I. I NTRODUCTION
Imitation Learning or Learning from Demonstrations
(LfD) is a learning mechanism often encountered in
humans because of how direct a demonstration can transmit
information. If an action is taught by explaining it, a further
level of abstraction is needed, which direct demonstration
can bypass. However, it does require other abilities that
can be found in humans but are more complex than they
might seem. When learning from a demonstration, correctly
perceiving the demonstration is the first step. Then, this
perceived action has to be transferred to ourselves so that
we can imitate it. One of the necessary abilities for this
process is perspective-taking.
Visual and spatial perspective-taking is the ability to
see someone and imagine what they are seeing. This
ability requires an understanding of the objects that are in
someone’s field of view and understanding that they will
see the objects differently. The objects will have to be
transferred from the position, shape, and size one person is
seeing them to the position, shape, and size another person
is seeing them. In humans this ability develops between the
*The authors gratefully acknowledge support from the DFG (CML,
MoReSpace, LeCAREbot), BMWK (VERIKAS), and the European Commission (TRAIL, TERAIS).
1 Josua Spisak and Stefan Wermter are with the Knowledge Technology
(WTM) group, Department of Informatics, University of Hamburg, 22527
Hamburg, Germany josua.spisak@uni-hamburg.de
2 Matthias Kerzel is with HITeC, 22527 Hamburg, Germany
Fig. 1. On the left, the figure shows a demonstration in the simulation and
the resulting imitation on the real robot on the right.
ages 3-4 [1]. In this paper, we show that diffusion models
can also learn perspective-taking. This capability allows our
approach to directly find the joint configurations necessary
to imitate behaviour as seen from a third-person view. It
combines the need to perceive the demonstration and to
transfer it to our perspective and even body when predicting
joint values.
Apart from directly learning from imitation, our approach
can also transfer images from the third-person view to the
first-person view. This functionality can be used to generate
data for the many existing imitation learning approaches
that already exist. ”Unfortunately, hitherto imitation learning
methods tend to require that demonstrations are supplied in
the first-person [...] While powerful, this kind of imitation
learning is limited by the relatively hard problem of
collecting first-person demonstrations” Stadie et al. [2]. Our
approach is able to generate the first-person view from
the third-person view, thereby removing the challenge of
collecting first-person view data. We also attempt to directly
imitate the demonstrations from the third-person view.
The field of image generation has been quickly improving
over the past years through GANs and diffusion models [3],
[4], [5], [6]. To generate the first-person view, we make use of
the diffusion models in particular. Diffusion models work by
purposely putting noise on a desired output and then iterative
denoising it again. Through this process, the model learns
to generate the desired output from pure noise by detecting
general patterns in the desired outputs. The model also learns
to build upon its last prediction as the denoising process
is iterative, so it also develops the ability to perceive the
given images quite well. To direct the generated output, a
condition can be added to the diffusion model. This can be
a label or a prompt. However, diffusion has also been used
for tasks in which the condition is what would be the input
in classical models, such as an image for object detection
[7]. We follow this trend of having a high-dimensional and
complex condition, that should result in a specific generated
image. The condition displaying the third-person image is
given to the model in various stages to allow it to generate
a fitting first-person image.
Our main contributions are:
• A novel architecture to generate images in first-person
perspective from images in third-person perspective for
a humanoid robot. 1
• Our model outperforms competing approaches such as
pix2pix and CycleGAN.
• The approach can directly infer and translate the firstperson pose from the third-person image as shown in
Fig. 1., through small modifications to the architecture.
• We publish a new dataset with pairs of images for thirdperson and first-person perspectives.2
II. R ELATED W ORK
Isola et al. [8] demonstrated that generative image-toimage models can solve a multitude of tasks. Ranging
from changing an image from black and white to coloured,
to generating an image just from the edges or turning
segmentation labels into a scene image. They achieved
this using conditional Generative Adversarial Networks
(GANs). Further advancements in image-to-image models
have been made by incorporating special losses such as
cycle consistency loss, which ensures specific outputs for
given inputs [9]. Going from the third-person perspective
to the first-person perspective can also be done using their
model, the results tend to be not as good as for some of the
other tasks.
Specialised models have been developed for this specific
task of perspective-taking. To focus the model on a general
understanding of the demonstration, the kernel sizes of the
convolutional layers were increased, and self-attentive layers
were added to help with understanding the relations between
different parts of the input images. As a demonstration
is often a continuous video, past frames can be used to
improve the performance [10]. In a more general approach
towards perspective transfer without the direct relation
to robots and imitation learning, Liu et al. proposed the
Parallel GAN (P-GAN) architecture [11]. By doing both a
transfer from first-person to third-person and, at the same
time, a transfer from third-person to first-person, they were
able to outperform other models in this task, such as X-Seq
or X-Fork [12] where the GANs were not working in parallel.
In base diffusion models, a neural network architecture
is used that starts with an image input, encodes it and
1 We
plan to publish our code at
hamburg.de/en/inst/ab/wtm/research/software.html
2 We
plan to publish our dataset at
hamburg.de/en/inst/ab/wtm/research/corpora.html
https://www.inf.unihttps://www.inf.uni-
decodes it again to reconstruct it. The key is that different
levels of noise are put on the input image. The model,
therefore, has to learn to either predict which parts of
the image are noise, or what it should look like without
the noise [3]. Once trained, the model can iteratively
generate images by progressively removing noise levels
encountered during training. As diffusion models possess a
similar purpose to GANs in generating images, it does not
come as a surprise that they can be used for many similar
tasks, including image-to-image translations. Pix2Pix-zero
is one such model, which is able to do zero-shot image
translation by changing an image along an ”edit direction”
[9]. Other diffusion approaches focus more on generating
images from text prompts, layouts or semantic synthesis [13].
Similar to GANs, diffusion models can also be conditioned
to generate specified output. In diffusion models, there is
always one natural way of condition, which is the time
step. This is a parameter that tells the network what
level of noise it can expect so that it can adjust to that.
Additional conditions, such as a text embedding to direct
the output, can be added to this time step [14]. The
conditioning has also been used for tasks a bit further
away from generating images, such as action segmentation
[15] or object detection [7]. Here, the output is the
bounding box or frame label for the condition input. These
works use an encoder-decoder structure instead of a U-Net,
and the condition is added to the input of the decoder layers.
When it comes to imitation, movements play a big
role. For robots, Inverse Kinematics solvers tackle the task
of moving kinematic body parts to predefined targets, a
problem that can quickly increase in complexity with the
Degrees of Freedom (DoF) the body part possesses [16].
There can be infinite solutions for one inverse kinematic
problem, making inverse kinematics very challenging [17].
Deep learning methods, including generative models, have
also been adapted to this task, generating the movement
necessary to reach the defined position. Similar to imageto-image tasks, the relation between the in- and output is
crucial, making losses like the cycle consistency loss useful
for this task [18]. Some approaches have also experimented
with further changes to the GAN architecture, changing the
input, the costs, the output or ensembling multiple networks
[19].
Learning from demonstration is utilised to get past many
challenges of machine learning. From a demonstration,
much information about how to solve a task can be gained,
and the demonstrated solution can be used to recreate that
success. Imitation learning can be used to allow robots
to learn movement planning. As going from RBG images
to joints can be difficult, different kinds of observing the
demonstrations have been used, such as voxels [20] or point
clouds [21]. Although, for human data, pose detectors exist
that are able to find key points of the poses directly from
RGB images [22], demonstrations are often performed from
a first-person perspective so that trajectories stay similar
and the demonstration is closer to the imitation. To directly
learn from a third-person perspective, the trajectories have
to be translated to the first-person perspective, which
can be achieved with the use of generative models [2].
Even just mirroring a pose seen in a mirror can be very
challenging for a robot. Hart and Scassellati developed an
approach that makes use of six models, an end-effector
model, a perceptual model, a perspective-taking model, a
structural model, an appearance model and a functional
model for this challenge. Their loss was six times higher
for demonstrations from third-person view compared to
demonstrations from first-person view.[23].
III. M ETHODS
Our approach is based on a conditioned diffusion model.
In contrast to the model from Ho et al. [3], we use a full
image as our condition, akin to how DiffusionDet [7] or
Diffusion Action Segmentation [15] made use of diffusion
models. The condition is an image captured from a camera
opposite the robot, representing how a person sitting across
from it would see the robot. The model then generates the
scene as seen by the robot. As the approach uses diffusion,
we also have another input, which during inference is
simply noise and during training consists of the desired
output with noise applied. The level of noise depends on a
randomly chosen time step between 0 and 1000 for each
sample. A cosine noise schedule calculates the amount of
noise depending on the chosen time step [24]. The cosine
schedule allows for a smoother transition from no noise to
full noise compared to a linear noise schedule. While the
network should mostly rely on the condition, the smoother
transition can increase the benefits of the iterative denoising
process.
The architecture of our model is shown in Fig. 2.
The structure of the model largely follows a U-Net [25],
however, there are no latent connections between the
primary downward and upward leading paths of the U-Net.
Instead, we added a secondary downward leading path that
has the purpose of encoding our condition. This path is only
connected through the latent connections to the rest of the
architecture. Our model directly predicts the desired output,
instead of trying to predict the noise, as proposed by Song
et al. [4]. This decision was made as our goal lies more in
correct predictions and accuracy than in the model’s ability
to generate new ideas or varying outputs, as could be desired
when generating images from prompts. During inference, we
start with using randomly generated noise for our input and
the condition, which is the RBG image taken from a different
perspective providing the information needed to replicate
the pose seen in it. Predicting the desired result allows us
to utilise only 50 denoising steps, significantly accelerating
the inference. Our model leverages both the information
provided by the condition and that acquired during later
Fig. 2. Our architecture has two inputs: the condition, which is the thirdperson view of the robot in the top middle and the noised output during
training or, as seen in the top left image, simply noise during inference.
Then, the top left image shows the noise that is the input during inference,
and the top right image shows the output of our model, the first-person
view from the robot. Both input images are decoded, each through one of
the downward paths. The decoded version of the noise is directly led into the
upward path that decodes the image to the output, while the encoded version
of the condition is led into the upward path through latent connections.
steps of the iterative denoising process to generate its output.
Both paths follow the same structure of convolution
layers, attention layers and maxpooling layers. The first
path ends in two convolution layers that do not exist in
the second downward path. The upward path consists of
convolutional layers, self-attention layers and upsampling
layers in order to get back to the dimensions of the original
input. The latent connections from the second downward
path to the upward path always go from the self-attention
layer to the upsampling layers. The time step is encoded
and given in the linear layers of the ”Down” and ”Up”
modules.
We can modify our architecture by adding one linear
layer in order to encode the input from the form of the
joint values, which are two arrays of length 13, to the same
dimensions as our conditioning image, which is an RBG
image with a resolution of 64 × 64 pixels. Additionally, one
linear decoding layer is appended at the end, to convert the
dimensions from 64 × 64 × 3 into 2 × 13 joint values. This
allows us to generate joint states instead of an RGB image.
To create our dataset, we made use of the NICOL
simulation introduced by Kerzel et al. [26]. We use two
cameras in the simulation, one situated in the head of
the robot and the second one positioned in front of it at
a distance that allows us to fully capture the robot. The
robot’s head is tilted downwards to ensure the actions
are captured in the first-person view. The NICOL robot
is a semi-humanoid robot without legs integrated into a
collaborative workspace. We make the robot randomly
assume poses for both of its arms. In each of the kinematic
chains representing the arms, we have 13 joints. Five of
these joints are in the hands, while the other eight are
in the arms. The joint values range from minus pi to pi,
which we scale to be between 0 and 1 for the training.
After the poses have been taken, we take one image from
both cameras and record the joint values from the robot.
The end-effector position and rotation are also recorded
in seven values for each of the hands, with the first three
values representing the x, y, and z coordinates and the other
four values representing the orientation. During the data
collection, 10000 samples were recorded. For training, 80%
of the dataset is used, while 20% is used for validation. All
the images are RBG and have a height and a width of 64
pixels. The random poses come from a uniform distribution
but are limited to stay in the visual range and within the
ranges of the joints. There are no images where the arms are
behind the robot’s back or otherwise obscured. The finger
joints are always kept in the same position, so only eight of
the thirteen recorded joint values per arm are changed. We
have tried out multiple preprocessing steps, such as rotating,
flipping, normalising and masking parts of the image.
The network was trained for 100 epochs; as mentioned
earlier, the noise applied to each training sample is chosen
randomly, and there are 1000 different levels of noise
possible. This extends the variance of our dataset, as the
model is supposed to predict the same result regardless of
the level of noise that was applied to the input. We utilised
the Adam optimiser and the MSE loss function.
Fig. 3. One example of predicting the first-person view, where the direct
prediction differs from the prediction after iterative denoising. In the direct
prediction, two thumbs can be identified on the hand, whereas only one is
left after the iterative denoising, although in the wrong position.
In this end-to-end approach, the network manages to
identify the three-dimensional pose of the robot from a twodimensional RBG image and to transpose that pose from a
third-person perspective to an egocentric perspective, which
can be expressed through a generated RGB image or joint
states.
IV. R ESULTS
We ran multiple versions of our model to evaluate the
strengths and weaknesses. First off, we have the image-toimage model that transfers an image from the third-person
perspective to the first-person perspective. Both qualitative
and quantitative evaluations were performed.
We compare the image generation part of our approach
to the pix2pix [8] model and the CycleGAN [9] model. We
make use of the implementation provided by the authors of
these papers3 and use the standard configurations for both
models to train them on our dataset. We evaluate both of
these models as well as our own on three metrics with our
test set. The metrics used are mean square error (MSE),
L1 norm (L1) and the structural similarity index measure
(SSIM) [27]. For MSE and L1, smaller is better, while the
opposite is true for the SSIM. Our model outperformed
both comparison models across all three metrics, as shown
in Tab. I.
Qualitative analysis, Fig. 4, shows that CycleGan seems
to be unable to correctly adjust to the condition, always
3 https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix/tree/master
Fig. 4. Qualitative comparisons between our model and pix2pix as well as CycleGAN. Five examples (A,B,C,D,E) are shown to illustrate the differences
between the models. CycleGan has almost no variance between its output, pix2pix mostly seems to have the correct poses except for example C, but often
is unable to correctly recreate the hands and finger structure. Our model always has the correct pose and recreates the hands in a lower resolution.
TABLE I
E VALUATION AND COMPARISON OF OUR MODEL , PIX 2 PIX AND
C YCLE GAN.
Model
CycleGAN
pix2pix
Ours
MSE
0.0221
0.0252
0.0007
L1
0.0815
0.0692
0.0086
SSIM
0.6482
0.7134
0.9773
generating the same pose for the first-person perspective.
Pix2pix goes one step further with the generated image
showing an understanding of the translation from third- to
first-person. Compared to our result, however, the hands
are often impossible to make out; there are still significant
differences in the position and size of the arms and in some
cases, artefacts appear in the image. Our results only seem
to differ in the resolution compared to the goal images
taken directly from the simulation. The hands and arms are
in the correct positions, with even the fingers being clearly
visible and discernible.
For the iterative denoising, we tried out doing 0 steps,
50 steps, 200 steps or the full 1000 steps. In most of our
results, the differences between the goal and the image
resulting either from a direct prediction or from a prediction
Fig. 5.
Joint predictions from third-person image.
after going through any number of iterative denoising steps
are seemingly impossible to make out, with the prediction
from the first step already looking very close to the ground
truth. We did, however, come across one image, as seen
in Fig. 3, in which finding the right position of the thumb
was difficult for the model, resulting in two thumbs for
the direct prediction, which was reduced to one thumb
after the denoising process although in the wrong position.
There are also cases where some of the blur from the goal
images captured in the simulation was smoothed out in the
predictions.
When it comes to predicting the joints, we have again
shown both quantitative, through the mean square error
on our validation set, and qualitative results, by showing
images of the poses resulting from the predicted joints.
Our average mean squared error over the validation set
is 27e-4 for the joints. It appears that the model finds it
challenging to generalise from the training set, where we
are able to get to a loss as low as 3.3e-6. While the general
configuration seems to go in the correct direction, often,
some of the joints are not completely accurate, shifting
the whole arm in a wrong direction. While not as good as
the image-generating part, the model still demonstrated an
ability to infer a relationship between the condition and the
goal joint configuration as seen in Fig. 5.
We combined the results introduced so far, by training
our joint position model on the first-person images from the
demonstration. As the first-person images are supposed to be
easier to learn from, and we are capable of generating them,
it makes sense to see if learning the joint values from them is
Fig. 6. One example of predicting joint values from the first-person view,
where the imitation looks similar from the first-person view while looking
very different from the third-person view.
easier. By changing the input images, we managed to reduce
the training error to 3e-7 and the validation error to 0.0017,
clearly showing the use of the first-person perspective.
Augmenting the input data with rotations, flips or maskings
was able to further improve the validation error to 0.0014
where we reached a limit. One example of the results from
using first-person images for the condition is shown in Fig. 6.
It seems to be more difficult to directly predict the joints
instead of recreating the first-person image, despite both
requiring a spatial understanding through the third-person
image. One possible reason for that is likely the redundancy
in the arm. Allowing for many poses that look similar while
having very different joint configurations. Fig. 6 shows one
particular interesting result demonstrating this, where the
first-person view looks very similar, while the actual pose
looks quite different. With so many very similar looking
poses, the model tends to predict values closer to the mean.
To evaluate the predicted joints in more detail, Tab. II
shows the mean square error for each of the joints in the
arm; we did not evaluate the five-finger joint values. All
versions of our model have an error far below the standard
deviation of the dataset. The individual errors also show a
very close relation to the existing standard deviation. Even
though different parts of the robot are visible in the firstperson view compared to the third-person view the different
models do not show a large discrepancy between the visible
and non-visible parts.
TABLE II
T HE MEAN SQUARED ERROR FOR EACH OF THE JOINT VALUES . A LL THE
VALUES ARE E -4.
T HE PREDICTION ERROR IS CORRELATING WITH THE
STANDARD DEVIATION OF THE GIVEN JOINT.
Model
Left arm
dataset std
third-person
first-person
augmented
Right arm
dataset std
third-person
first-person
augmented
j1
j2
j3
j4
j5
j6
j7
j8
297
14
12
11
762
79
71
67
914
111
95
82
180
4
4
3
436
28
24
20
153
3
3
3
266
11
10
7
294
13
10
9
271
11
10
16
723
69
61
52
860
96
85
78
114
2
2
2
307
14
13
10
119
2
2
2
279
12
10
18
390
22
19
16
V. C ONCLUSION
We have presented that we can generate images in
first-person perspective from a third-person perspective that
are close to indistinguishable from the real first-person
perspective, surpassing established approaches such as
pix2pix or CycleGAN. While significant progress has been
made, challenges remain in imitation learning from a thirdperson perspective. We believe that adding a pose detector
to our model to get a more direct representation of the
demonstrated pose can be a fruitful improvement, especially
when extending the approach to the real world. Adding
more three-dimensional information like this and making
use of models focused on detecting arm poses should
make it easier for our model to understand which specific
joint configuration is needed. Another way of continuing
this work would be to add an already existing approach
that uses image data from an ego perspective to imitate poses.
Throughout this paper, we have demonstrated that our
approach effectively learns spatial relations and can resize
and rotate objects to adjust them to a new perspective.
Our results also highlight the utility of first-person images
in learning joint positions, which significantly reduced our
loss compared to third-person images. Additionally, we have
demonstrated the potential of diffusion models for pose
estimation and third-person imitation.
R EFERENCES
[1] S., Andrew, I. Apperly, and D. Samson. ”The use of embodied selfrotation for visual and spatial perspective-taking.” Frontiers in Human
Neuroscience Vol. 7, P. 698, 2013
[2] B. C. Stadie, P. Abbeel, and I. Sutskever. ”Third-Person Imitation
Learning.” arXiv preprint arXiv:1703.01703 2017.
[3] J. Ho, A. Jain, and P. Abbeel. ”Denoising Diffusion Probabilistic
Models.” Advances in Neural Information Processing Systems (NIPS)
Vol. 33, P. 6840-6851, 2020
[4] J. Song, C. Meng, and S. Ermon. ”Denoising diffusion implicit
models.” arXiv preprint arXiv:2010.02502 (2020).
[5] B. Zhang, S. Gu, B. Thang, J. Bao, D. Chen, F. Wen, Y. Wang and B.
Guo. ”Styleswin: Transformer-based GAN for High-resolution Image
Generation.” Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR). P. 11304-11314 2022.
[6] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishking, B.
McGrew, I. Sutskever and M. Chen. ”GLIDE: Towards Photorealistic
Image Generation and Editing with Text-Guided Diffusion Models .”
arXiv preprint arXiv:2112.10741 2021.
[7] S. Che, P. Sun, Y. Song, and P. Luo ”DiffusionDet: Diffusion Model
for Object Detection.” Proceedings of the IEEE/CVF International
Conference on Computer Vision (ICCV). P. 19830-19843, 2023.
[8] P. Isola, J. Zhu, T. Zhou and A. A. Efros. ”Image-to-Image Translation with Conditional Adversarial Networks.” Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition
(CVPR). P.1125-1134, 2017.
[9] J. Zhu, T. Park, P. Isola and A. A. Efros. ”Unpaired Image-toImage Translation using Cycle-Consistent Adversarial Networks.”
Proceedings of the IEEE International Conference on Computer Vision
(ICCV). P. 2223-2232, 2017.
[10] L. Garello, F. Rea, N. Noceti and A. Sciutti. ”Towards Third-Person
Visual Imitation Learning Using Generative Adversarial Networks.”
IEEE International Conference on Development and Learning (ICDL)
P.121-126, 2022.
[11] G. Liu, H. Latapie, O. Kilic and A. Lawrence. ”Parallel Generative
Adversarial Network for Third-person to First-person Image Generation.” Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition Workshops (CVPRW). P.1917-1923, 2022.
[12] K. Regmi and A. Borji. ”Cross-View Image Synthesis using Conditional GANs. Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), P.3501-3510 2018
[13] R. Rombach A. Blattmann, L. Dominik, P. Esser and B. Omer.
”High-Resolution Image Synthesis with Latent Diffusion Models.”
Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR). P.10684-10695, 2022.
[14] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton,
K. Ghasemipour, R. G. Lopes, B. K. Ayan, Z. Salimans and others.
”Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding.” Advances in Neural Information Processing Systems
(NIPS) Vol. 35, P. 36479-36494, 2022
[15] D. Liu, Q. Li, A. Dinh, T. Jiang, M. Shah and C. Xu ”Diffusion
Action Segmentation.” Proceedings of the IEEE/CVF International
Conference on Computer Vision (ICCV). P.10139-10149, 2023.
[16] A. Aristiduo, J. Lasenby, Y. Chrysanthou and A. Shamir. ”Inverse
Kinematics Techniques in Computer Graphics: A Survey.” Computer
graphics forum, Vol. 37, No. 6, P. 35-58, 2018.
[17] B. Siciliano and O. Kathib, Handbook of Robotics. Springer-Verlag
Vol. 200, 2008.
[18] J. Habekost, E. Strahl, P. Algeuer, M. Kerzel and S. Wermter ”CycleIK: Neuro-inspired Inverse Kinematics.” International Conference
on Artificial Neural Networks (ICANN). Springer Nature Switzerland,
P.457-470, 2023.
[19] T. S. Lembono, E. Oignat, J. Jankowski and S. Calinon. ”Learning
Constrained Distributions of Robot Configurations with Generative
Adversarial Network.” IEEE Robotics and Automation Letters Vol.
6.2 P. 4233-4240, 2021
[20] S. Mohit, S. Manuelli, and D. Fox. ”PERCEIVER-ACTOR: A MultiTask Transformer for Robotic Manipulation.” Conference on Robot
Learning (CoRL). PMLR, Vol. 205, P. 785-799, 2023.
[21] S. Song, A. Zeng, J. Lee and T. Funkhouser. ”Grasping in the Wild:
Learning 6DoF Closed-Loop Grasping from Low-Cost Demonstrations.” IEEE Robotics and Automation Letters, Vol. 5.3, P. 4978-4985,
2020
[22] Y. Xu, J. Zhang, Q. Zhang and D. Tao. ”ViTPose: Simple Vision
Transformer Baselines for Human Pose Estimation.” Advances in
Neural Information Processing Systems (NIPS), Vol. 35, P. 3857138584, 2022
[23] J. Hart, and B. Scassellati. ”Mirror Perspective-Taking with a Humanoid Robot.” Proceedings of the AAAI Conference on Artificial
Intelligence. Vol. 26.1, P.1990-1996, 2012.
[24] A. Nichol and P. Dhariwal. ”Improved Denoising Diffusion Probabilistic Models.” Proceedings of Machine Learning Research (PMLR). P.
8162-8171, 2021.
[25] R., Olaf, P. Fischer, and T. Brox. ”U-Net: Convolutional Networks
for Biomedical Image Segmentation.” Proceedings of Medical Image
Computing and Computer-Assisted Intervention (MICCAI), Springer
International Publishing, P.234-241, 2015.
[26] M. Kerzel, P. Algeuer, E. Strahl, N. Frick, J. Habekost, M. Eppe and
S. Wermter. ”NICOL: A Neuro-inspired Collaborative Semi-humanoid
Robot that Bridges Social Interaction and Reliable Manipulation.”
IEEE access, Vol. 11, P. 123531-123542, 2023.
[27] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli. ”Image Quality
Assessment: From Error Visibility to Structural Similarity.” IEEE
Transactions on Image Processing Vol. 13.4, P. 600-612 2004