Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Self-Supervised Learning of Depth and Ego-Motion for 3D Perception in Human Computer Interaction

Published: 25 September 2023 Publication History

Abstract

3D perception of depth and ego-motion is of vital importance in intelligent agent and Human Computer Interaction (HCI) tasks, such as robotics and autonomous driving. There are different kinds of sensors that can directly obtain 3D depth information. However, the commonly used Lidar sensor is expensive, and the effective range of RGB-D cameras is limited. In the field of computer vision, researchers have done a lot of work on 3D perception. While traditional geometric algorithms require a lot of manual features for depth estimation, Deep Learning methods have achieved great success in this field. In this work, we proposed a novel self-supervised method based on Vision Transformer (ViT) with Convolutional Neural Network (CNN) architecture, which is referred to as ViT-Depth. The image reconstruction losses computed by the estimated depth and motion between adjacent frames are treated as supervision signal to establish a self-supervised learning pipeline. This is an effective solution for tasks that need accurate and low-cost 3D perception, such as autonomous driving, robotic navigation, 3D reconstruction, and so on. Our method could leverage both the ability of CNN and Transformer to extract deep features and capture global contextual information. In addition, we propose a cross-frame loss that could constrain photometric error and scale consistency among multi-frames, which lead the training process to be more stable and improve the performance. Extensive experimental results on autonomous driving dataset demonstrate the proposed approach is competitive with the state-of-the-art depth and motion estimation methods.

References

[1]
Antonio Cipolletta, Valentino Peluso, Andrea Calimera, Matteo Poggi, and Stefano Mattoccia. 2021. Energy-quality scalable monocular depth estimation on low-power CPUs. IEEE Internet of Things Journal 99 (2021), 1–1.
[2]
Yanan Miao, Xiaoming Tao, Xiaoli Xu, and Jianhua Lu. 2019. Joint 3-D shape estimation and landmark localization from monocular cameras of intelligent vehicles. IEEE Internet of Things Journal (2019).
[3]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NIPS (2017).
[4]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 8 (2018), 2011–2023.
[5]
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets Robotics: The KITTI dataset. International Journal of Robotics Research (IJRR) (2013).
[6]
Richard Hartley and Andrew Zisserman. 2000. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN 0-521-62304-9, 2000.
[7]
David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. NIPS (2014).
[8]
Jun Li, Reinhard Klein, and Angela Yao. 2017. A two-streamed network for estimating fine-scaled depth maps from single RGB images. CVPR (2017), 3372–3380.
[9]
Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe. 2017. Unsupervised learning of depth and ego-motion from video. CVPR (2017).
[10]
Chaoyang Wang, José Miguel Buenaposada, Rui Zhu, and Simon Lucey. 2018. Learning depth from monocular videos using direct methods. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022–2030.
[11]
Zhichao Yin and Jianping Shi. 2018. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1983–1992.
[12]
Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia Angelova. 2019. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Proceedings of the Thirty-third AAAI Conference on Artificial Intelligence, AAAI, Honolulu, Hawaii, USA, 27 January–1 February 2019; AAAI Press: Menlo Park, CA, USA. 8001–8008.
[13]
Hanhan Li, Ariel Gordon, Hang Zhao, Vincent Casser, and Anelia Angelova. 2020. Unsupervised monocular depth learning in dynamic scenes, arXiv preprint arXiv: 2010.16404.
[14]
Jiawang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, Chunhua Shen, Ming-Ming Cheng, and Ian Reid. 2019. Unsupervised scale-consistent depth and ego-motion learning from monocular video. Advances in Neural Information Processing Systems (NeurIPS) 32 (2019), 35–45.
[15]
Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J. Brostow. 2019. Digging into self-supervised monocular depth estimation. ICCV (2019), 3827–3837.
[16]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
[17]
Xiaohan Tu, Cheng Xu, Siping Liu, Renfa Li, Guoqi Xie, Jing Huang, and Laurence Tianruo Yang. 2021. Efficient monocular depth estimation for edge devices in internet of things. IEEE Transactions on Industrial Informatics 17, 4 (2021), 2821–2832.
[18]
Fu Huan, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. 2018. Deep ordinal regression network for monocular depth estimation. CVPR (2018), 2002–2011.
[19]
Nan Yang, Rui Wang, Jorg Stuckler, and Daniel Cremers. 2018. Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. ECCV.
[20]
Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. 2016. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (2016), 2024–2039.
[21]
Jogendra Nath Kundu, Phani Krishna Uppala, Anuj Pahuja, and R. Venkatesh Babu. 2018. AdaDepth: Unsupervised content congruent adaptation for depth estimation. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2656–2665.
[22]
Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe, 2017. Semi-supervised deep learning for monocular depth map prediction. IEEE Conference on Computer Vision & Pattern Recognition.
[23]
Ishit Mehta, Parikshit Sakurikar, and P. J. Narayanan. 2018. Structured adversarial training for unsupervised monocular depth estimation. International Conference on 3D Vision (3DV). 314–323.
[24]
Zhenheng Yang, Peng Wang, Wei Xu, Liang Zhao, and Ramakant Nevatia. 2018. Unsupervised learning of geometry with edge-aware depth-normal consistency. AAAI (2018).
[25]
Reza Mahjourian, Martin Wicke, and Anelia Angelova. 2018. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5667–5675.
[26]
Chenxu Luo, Zhenheng Yang, Peng Wang, Yang Wang, Wei Xu, Ram Nevatia, and Alan Yuille. 2019. Every Pixel Counts ++: Joint learning of geometry and motion with 3D holistic understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 10 (2019), 2624–2641.
[27]
Yue Meng, Yongxi Lu, Aman Raj, Samuel Sunarjo, Rui Guo, Tara Javidi, Gaurav Bansal, and Dinesh Bharadia. 2019. SIGNet: Semantic instance aided unsupervised 3D geometry perception. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9802–9812.
[28]
Ariel Gordon, Hanhan Li, Rico Jonschkowski, and Anelia Angelova. 2019. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. IEEE/CVF International Conference on Computer Vision (ICCV). 8976–8985.
[29]
Tianwei Shen, Zixin Luo, Lei Zhou, Hanyu Deng, Runze Zhang, Tian Fang, and Long Quan. 2019. Beyond photometric loss for self-supervised ego-motion estimation. International Conference on Robotics and Automation (ICRA). 6359–6365.
[30]
Yuliang Zou, Zelun Luo, and Jia-Bin Huang. 2018. DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proc. 15th European Conference, Munich, Germany.
[31]
Rahul Garg, Neal Wadhwa, Sameer Ansari, and Jonathan T. Barron. 2019. Learning single camera depth estimation using dual-pixels. ICCV (2019), 7627–7636.
[32]
Jiawang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, Chunhua Shen, Ming-Ming Cheng, and Ian Reid. 2021. Unsupervised scale-consistent depth and ego-motion learning from monocular video. IJCV (2021).
[33]
Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. 2016. Deeper depth prediction with fully convolutional residual networks. 2016. Fourth International Conference on 3D Vision (3DV). 239–248.
[34]
Lam Huynh, Phong Nguyen-Ha, Jiri Matas, Esa Rahtu, and Janne Heikkilä. 2020. Guiding monocular depth estimation using depth-attention volume. European Conference on Computer Vision (ECCV). 581–597.
[35]
Nikolaus Mayer, Eddy Ilg, Philipp Fischer, Caner Hazirbas, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. 2018. What makes good synthetic training data for learning disparity and optical flow estimation?. IJCV 126 (2018), 942–960.
[36]
Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. 2016. Unsupervised CNN for single view depth estimation: Geometry to the rescue. ECCV (2016).
[37]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. IEEE International Conference on Computer Vision (ICCV). 2980–2988.
[38]
Antoni Buades, Bartomeu Coll, and J.-M. Morel. 2005. A non-local algorithm for image denoising. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2 (2005), 60–65.
[39]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 7794–7803.
[40]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2020. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877.
[41]
Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. 2020. End-to-end video instance segmentation with transformers. arXiv preprint arXiv:2011.14503.
[42]
Lin Huang, Jianchao Tan, Ji Liu, and Junsong Yuan. 2020. Hand-transformer: Non-autoregressive structured modeling for 3D hand pose estimation. European Conference on Computer Vision, 17–33.
[43]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[44]
Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. 2004. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (April 2004), 600–612.
[45]
Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and Ian Reid. 2018. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[46]
Nan Yang, Lukas von Stumberg, Rui Wang, and Daniel Cremers. 2020. D3VO: Deep depth, deep pose and deep uncertainty for monocular visual odometry. CVPR (2020), 1281–1292.
[47]
Ling Li, Xiaojian Li, Shanlin Yang, Shuai Ding, Alireza Jolfaei, and Xi Zheng. 2021. Unsupervised-learning-based continuous depth and motion estimation with monocular endoscopy for virtual reality minimally invasive surgery. IEEE Transactions on Industrial Informatics 17, 6 (2021), 3920–3928.
[48]
Feng Li, Jie Hao, Jin Wang, Jun Luo, Ying He, Dongxiao Yu, and Xiuzhen Cheng. 2019. VisioMap: Lightweight 3-D scene reconstruction toward natural indoor localization. IEEE Internet of Things Journal (2019).
[49]
Anurag Ranjan, Varun Jampani, Lukas Balles, Kihwan Kim, Deqing Sun, Jonas Wulff, and Michael J. Black. 2019. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. CVPR (2019), 12232–12241.

Cited By

View all
  • (2024)Monocular Depth and Ego-motion Estimation with Scale Based on Superpixel and Normal ConstraintsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367497720:10(1-26)Online publication date: 1-Jul-2024
  • (2024)Self-Supervised Monocular Depth Estimation via Binocular Geometric Correlation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366357020:8(1-19)Online publication date: 13-Jun-2024
  • (2024)DCL-depth: monocular depth estimation network based on iam and depth consistency lossMultimedia Tools and Applications10.1007/s11042-024-18877-7Online publication date: 25-Mar-2024
  • Show More Cited By

Index Terms

  1. Self-Supervised Learning of Depth and Ego-Motion for 3D Perception in Human Computer Interaction

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 2
    February 2024
    548 pages
    EISSN:1551-6865
    DOI:10.1145/3613570
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 September 2023
    Online AM: 23 March 2023
    Accepted: 15 March 2023
    Revised: 07 December 2022
    Received: 10 November 2021
    Published in TOMM Volume 20, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Autonomous driving
    2. 3D perception
    3. monocular depth and motion estimation
    4. Self-supervised Learning
    5. visual SLAM

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Shanghai Local Capacity Enhancement
    • Science and Technology Innovation Action Plan
    • Shanghai Science and Technology Commission
    • Chenguang talented program of Shanghai

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)250
    • Downloads (Last 6 weeks)16
    Reflects downloads up to 20 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Monocular Depth and Ego-motion Estimation with Scale Based on Superpixel and Normal ConstraintsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367497720:10(1-26)Online publication date: 1-Jul-2024
    • (2024)Self-Supervised Monocular Depth Estimation via Binocular Geometric Correlation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366357020:8(1-19)Online publication date: 13-Jun-2024
    • (2024)DCL-depth: monocular depth estimation network based on iam and depth consistency lossMultimedia Tools and Applications10.1007/s11042-024-18877-7Online publication date: 25-Mar-2024
    • (2024)Resolution-sensitive self-supervised monocular absolute depth estimationApplied Intelligence10.1007/s10489-024-05414-054:6(4781-4793)Online publication date: 5-Apr-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media