Nothing Special   »   [go: up one dir, main page]

Skip to main content

Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 (ECCV 2020)

Abstract

Monocular visual odometry (VO) suffers severely from error accumulation during frame-to-frame pose estimation. In this paper, we present a self-supervised learning method for VO with special consideration for consistency over longer sequences. To this end, we model the long-term dependency in pose prediction using a pose network that features a two-layer convolutional LSTM module. We train the networks with purely self-supervised losses, including a cycle consistency loss that mimics the loop closure module in geometric VO. Inspired by prior geometric systems, we allow the networks to see beyond a small temporal window during training, through a novel a loss that incorporates temporally distant (e.g., O(100)) frames. Given GPU memory constraints, we propose a stage-wise training mechanism, where the first stage operates in a local time window and the second stage refines the poses with a “global” loss given the first stage features. We demonstrate competitive results on several standard VO datasets, including KITTI and TUM RGB-D.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Since accurate pose prediction is the primary focus of this paper, we name our method as VO instead of SfM or SLAM.

  2. 2.

    Table 1 in the supplementary material.

  3. 3.

    The sequence names are available in the supplementary material.

References

  1. Bian, J.W., et al.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: NeurIPS (2019)

    Google Scholar 

  2. Bloesch, M., Czarnowski, J., Clark, R., Leutenegger, S., Davison, A.J.: CodesLAM–learning a compact, optimisable representation for dense visual slam. In: CVPR (2018)

    Google Scholar 

  3. Bogdan, O., Eckstein, V., Rameau, F., Bazin, J.C.: DeepCalib: a deep learning approach for automatic intrinsic calibration of wide field-of-view cameras. In: CVMP (2018)

    Google Scholar 

  4. Cadena, C., et al.: Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Trans. Robot. 32(6), 1309–1332 (2016)

    Article  Google Scholar 

  5. Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y.: End-to-end continuous speech recognition using attention-based recurrent NN: first results. arXiv preprint arXiv:1412.1602 (2014)

  6. Dhiman, V., Tran, Q.H., Corso, J.J., Chandraker, M.: A continuous occlusion model for road scene understanding. In: CVPR (2016)

    Google Scholar 

  7. Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: ICCV (2015)

    Google Scholar 

  8. Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. TPAMI 40(3), 611–625 (2017)

    Article  Google Scholar 

  9. Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_54

    Chapter  Google Scholar 

  10. Forster, C., Pizzoli, M., Scaramuzza, D.: SVO: fast semi-direct monocular visual odometry. In: ICRA (2014)

    Google Scholar 

  11. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. IJRR 32(11), 1231–1237 (2013)

    Google Scholar 

  12. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: CVPR (2012)

    Google Scholar 

  13. Godard, C., Mac Aodha, O., Firman, M., Brostow, G.: Digging into self-supervised monocular depth estimation. In: ICCV (2019)

    Google Scholar 

  14. Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: ICML (2014)

    Google Scholar 

  15. Grupp, M.: evo: Python package for the evaluation of odometry and SLAM (2017). https://github.com/MichaelGrupp/evo

  16. Hartley, R.I., Sturm, P.: Triangulation. CVIU 68(2), 146–157 (1997)

    Google Scholar 

  17. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2014)

    Google Scholar 

  18. Klein, G., Murray, D.: Parallel tracking and mapping for small AR workspaces. In: ISMAR (2007)

    Google Scholar 

  19. Kümmerle, R., Grisetti, G., Strasdat, H., Konolige, K., Burgard, W.: g\(^2\)o: a general framework for graph optimization. In: ICRA (2011)

    Google Scholar 

  20. Li, R., Wang, S., Long, Z., Gu, D.: UnDeepVO: monocular visual odometry through unsupervised deep learning. In: ICRA (2018)

    Google Scholar 

  21. Li, Y., Ushiku, Y., Harada, T.: Pose graph optimization for unsupervised monocular visual odometry. In: ICRA (2019)

    Google Scholar 

  22. Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: CVPR (2018)

    Google Scholar 

  23. Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)

    Article  Google Scholar 

  24. Mur-Artal, R., Tardós, J.D.: ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Trans. Robot. 33(5), 1255–1262 (2017)

    Article  Google Scholar 

  25. Nistér, D., Naroditsky, O., Bergen, J.: Visual odometry. In: CVPR (2004)

    Google Scholar 

  26. Ranjan, A., et al.: Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: CVPR (2019)

    Google Scholar 

  27. Scaramuzza, D., Fraundorfer, F.: Visual odometry [Tutorial]. IEEE Robot. Autom. Mag. 18(4), 80–92 (2011)

    Article  Google Scholar 

  28. Schubert, D., Demmel, N., Usenko, V., Stuckler, J., Cremers, D.: Direct sparse odometry with rolling shutter. In: ECCV (2018)

    Google Scholar 

  29. Shen, T., et al.: Beyond photometric loss for self-supervised ego-motion estimation. In: ICRA (2019)

    Google Scholar 

  30. Sheng, L., Xu, D., Ouyang, W., Wang, X.: Unsupervised collaborative learning of keyframe detection and visual odometry towards monocular deep SLAM. In: ICCV (2019)

    Google Scholar 

  31. Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: NeurIPS (2015)

    Google Scholar 

  32. Song, S., Chandraker, M.: Robust scale estimation in real-time monocular SFM for autonomous driving. In: CVPR (2014)

    Google Scholar 

  33. Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: ICML (2015)

    Google Scholar 

  34. Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of RGB-D SLAM systems. In: IROS (2012)

    Google Scholar 

  35. Tang, C., Tan, P.: BA-Net: dense bundle adjustment network. In: ICLR (2019)

    Google Scholar 

  36. Teed, Z., Deng, J.: DeepV2D: video to depth with differentiable structure from motion. In: ICLR (2020)

    Google Scholar 

  37. Tiwari, L., Ji, P., Tran, Q.H., Zhuang, B., Anand, S., Chandraker, M.: Pseudo RGB-D for self-improving monocular SLAM and depth prediction. In: ECCV (2020)

    Google Scholar 

  38. Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjustment–a modern synthesis. In: International Workshop on Vision Algorithms (1999)

    Google Scholar 

  39. Ummenhofer, B., et al.: Demon: depth and motion network for learning monocular stereo. In: CVPR (2017)

    Google Scholar 

  40. Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee, H.: Learning to generate long-term future via hierarchical prediction. In: ICML (2017)

    Google Scholar 

  41. Wang, C., Miguel Buenaposada, J., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: CVPR (2018)

    Google Scholar 

  42. Wang, R., Pizer, S.M., Frahm, J.M.: Recurrent neural network for (un-) supervised learning of monocular video visual odometry and depth. In: CVPR (2019)

    Google Scholar 

  43. Wang, R., Schworer, M., Cremers, D.: Stereo DSO: large-scale direct sparse visual odometry with stereo cameras. In: ICCV (2017)

    Google Scholar 

  44. Wang, S., Clark, R., Wen, H., Trigoni, N.: DeepVO: towards end-to-end visual odometry with deep recurrent convolutional neural networks. In: ICRA (2017)

    Google Scholar 

  45. Wang, S., Clark, R., Wen, H., Trigoni, N.: End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks. IJRR 37(4–5), 513–542 (2018)

    Google Scholar 

  46. Xue, F., Wang, Q., Wang, X., Dong, W., Wang, J., Zha, H.: Guided feature selection for deep visual odometry. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11366, pp. 293–308. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20876-9_19

    Chapter  Google Scholar 

  47. Xue, F., Wang, X., Li, S., Wang, Q., Wang, J., Zha, H.: Beyond tracking: selecting memory and refining poses for deep visual odometry. In: CVPR (2019)

    Google Scholar 

  48. Yang, N., Wang, R., Gao, X., Cremers, D.: Challenges in monocular visual odometry: photometric calibration, motion bias, and rolling shutter effect. RAL 3(4), 2878–2885 (2018)

    Google Scholar 

  49. Yin, Z., Shi, J.: GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: CVPR (2018)

    Google Scholar 

  50. Zhan, H., Garg, R., Saroj Weerasekera, C., Li, K., Agarwal, H., Reid, I.: Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: CVPR (2018)

    Google Scholar 

  51. Zhou, H., Ummenhofer, B., Brox, T.: DeepTAM: deep tracking and mapping. In: ECCV (2018)

    Google Scholar 

  52. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)

    Google Scholar 

  53. Zhuang, B., Tran, Q.H., Ji, P., Cheong, L.F., Chandraker, M.: Learning structure-and-motion-aware rolling shutter correction. In: CVPR (2019)

    Google Scholar 

  54. Zhuang, B., Tran, Q.H., Lee, G.H., Cheong, L.F., Chandraker, M.: Degeneracy in self-calibration revisited and a deep learning solution for uncalibrated SLAM. In: IROS (2019)

    Google Scholar 

  55. Zou, Y., Luo, Z., Huang, J.B.: DF-Net: unsupervised joint learning of depth and flow using cross-task consistency. In: ECCV (2018)

    Google Scholar 

Download references

Acknowledgment

This work was part of Y. Zou’s internship at NEC Labs America, in San Jose. Y. Zou and J.-B. Huang were also supported in part by NSF under Grant No. (#1755785).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuliang Zou .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 87367 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zou, Y., Ji, P., Tran, QH., Huang, JB., Chandraker, M. (2020). Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12359. Springer, Cham. https://doi.org/10.1007/978-3-030-58568-6_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58568-6_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58567-9

  • Online ISBN: 978-3-030-58568-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics