3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators

Hsiao-Yu Tung, Zhou Xian, Mihir Prabhudesai, Shamit Lal, Katerina Fragkiadaki
Proceedings of the 2020 Conference on Robot Learning, PMLR 155:1669-1683, 2021.

Abstract

We propose an action-conditioned dynamics model that predicts scene changes caused by object and agent interactions in a viewpoint-invariant 3D neural scene representation space, inferred from RGB-D videos. In this 3D feature space, objects do not interfere with one another and their appearance persists over time and across viewpoints. This permits our model to predict future scenes long in the future by simply “moving” 3D object features based on cumulative object motion predictions. Object motion predictions are computed by a graph neural network that operates over the object features extracted from the 3D neural scene representation. Our model’s simulations can be decoded by a neural renderer into 2D image views from any desired viewpoint, which aids the interpretability of our latent 3D simulation space. We show our model generalizes well its predictions across varying number and appearances of interacting objects as well as across camera viewpoints, outperforming existing 2D and 3D dynamics models. We further demonstrate sim-to-real transfer of the learnt dynamics by applying our model trained solely in simulation to model-based control for pushing objects to desired locations under clutter on a real robotic setup.

Cite this Paper


BibTeX
@InProceedings{pmlr-v155-tung21a, title = {3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators}, author = {Tung, Hsiao-Yu and Xian, Zhou and Prabhudesai, Mihir and Lal, Shamit and Fragkiadaki, Katerina}, booktitle = {Proceedings of the 2020 Conference on Robot Learning}, pages = {1669--1683}, year = {2021}, editor = {Kober, Jens and Ramos, Fabio and Tomlin, Claire}, volume = {155}, series = {Proceedings of Machine Learning Research}, month = {16--18 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v155/tung21a/tung21a.pdf}, url = {https://proceedings.mlr.press/v155/tung21a.html}, abstract = {We propose an action-conditioned dynamics model that predicts scene changes caused by object and agent interactions in a viewpoint-invariant 3D neural scene representation space, inferred from RGB-D videos. In this 3D feature space, objects do not interfere with one another and their appearance persists over time and across viewpoints. This permits our model to predict future scenes long in the future by simply “moving” 3D object features based on cumulative object motion predictions. Object motion predictions are computed by a graph neural network that operates over the object features extracted from the 3D neural scene representation. Our model’s simulations can be decoded by a neural renderer into 2D image views from any desired viewpoint, which aids the interpretability of our latent 3D simulation space. We show our model generalizes well its predictions across varying number and appearances of interacting objects as well as across camera viewpoints, outperforming existing 2D and 3D dynamics models. We further demonstrate sim-to-real transfer of the learnt dynamics by applying our model trained solely in simulation to model-based control for pushing objects to desired locations under clutter on a real robotic setup.} }
Endnote
%0 Conference Paper %T 3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators %A Hsiao-Yu Tung %A Zhou Xian %A Mihir Prabhudesai %A Shamit Lal %A Katerina Fragkiadaki %B Proceedings of the 2020 Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2021 %E Jens Kober %E Fabio Ramos %E Claire Tomlin %F pmlr-v155-tung21a %I PMLR %P 1669--1683 %U https://proceedings.mlr.press/v155/tung21a.html %V 155 %X We propose an action-conditioned dynamics model that predicts scene changes caused by object and agent interactions in a viewpoint-invariant 3D neural scene representation space, inferred from RGB-D videos. In this 3D feature space, objects do not interfere with one another and their appearance persists over time and across viewpoints. This permits our model to predict future scenes long in the future by simply “moving” 3D object features based on cumulative object motion predictions. Object motion predictions are computed by a graph neural network that operates over the object features extracted from the 3D neural scene representation. Our model’s simulations can be decoded by a neural renderer into 2D image views from any desired viewpoint, which aids the interpretability of our latent 3D simulation space. We show our model generalizes well its predictions across varying number and appearances of interacting objects as well as across camera viewpoints, outperforming existing 2D and 3D dynamics models. We further demonstrate sim-to-real transfer of the learnt dynamics by applying our model trained solely in simulation to model-based control for pushing objects to desired locations under clutter on a real robotic setup.
APA
Tung, H., Xian, Z., Prabhudesai, M., Lal, S. & Fragkiadaki, K.. (2021). 3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators. Proceedings of the 2020 Conference on Robot Learning, in Proceedings of Machine Learning Research 155:1669-1683 Available from https://proceedings.mlr.press/v155/tung21a.html.

Related Material