Abstract
Existing unsupervised visual odometry (VO) methods either match pairwise images or integrate the temporal information using recurrent neural networks over a long sequence of images. They are either not accurate, time-consuming in training or error accumulative. In this paper, we propose a method consisting of two camera pose estimators that deal with the information from pairwise images and a short sequence of images, respectively. For image sequences, a transformer-like structure is adopted to build a geometry model over a local temporal window, referred to as transformer-based auxiliary pose estimator (TAPE). Meanwhile, a flow-to-flow pose estimator (F2FPE) is proposed to exploit the relationship between pairwise images. The two estimators are constrained through a simple yet effective consistency loss in training. Empirical evaluation has shown that the proposed method outperforms the state-of-the-art unsupervised learning-based methods by a large margin and performs comparably to supervised and traditional ones on the KITTI and Malaga dataset.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M et al (2016) Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467
Almalioglu Y, Saputra MRU, de Gusmao PP, Markham A, Trigoni N (2019) Ganvo: unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks. In: ICRA
Azuma RT (1997) A survey of augmented reality. Teleop Virtual Environ 6(4):355–385
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450
Blanco-Claraco JL, Moreno-Dueñas FÁ, González-Jiménez J (2014) The málaga urban dataset: high-rate stereo and lidar in a realistic urban scenario. Int J Robot Res 33(2):207–214
Chen C, Seff A, Kornhauser A, Xiao J (2015) Deepdriving: learning affordance for direct perception in autonomous driving. In: ICCV
Clark R, Wang S, Wen H, Markham A, Trigoni N (2017) Vinet: visual-inertial odometry as a sequence-to-sequence learning problem. In: AAAI
Costante G, Ciarfuglia TA (2018) LS-VO: learning dense optical subspace for robust visual odometry estimation. In: IEEE robotics and automation letters
DeSouza GN, Kak AC (2002) Vision for mobile robot navigation: A survey. In: TPAMI
Ding Y, Lin L, Wang L, Zhang M, Li D (2020) Digging into the multiscale structure for a more refined depth map and 3d reconstruction. Neural Comput Appl 32:11217–11228
Do Q, Jain LC (2010) Application of neural processing paradigm in visual landmark recognition and autonomous robot navigation. Neural Comput Appl 19(2):237–254
Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C, Golkov V, Van Der Smagt P, Cremers D, Brox T (2015) Flownet: learning optical flow with convolutional networks. In: ICCV
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. In: Advances in neural information processing systems
Engel J, Koltun V, Cremers D (2017) Direct sparse odometry. In: TPAMI
Engel J, Schöps T, Cremers D (2014) LSD-SLAM: large-scale direct monocular slam. In: ECCV
Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR
Geiger A, Ziegler J, Stiller C (2011) Stereoscan: dense 3d reconstruction in real-time. In: IV
Girdhar R, Carreira J, Doersch C, Zisserman A (2019) Video action transformer network. In: CVPR
Godard C, Mac Aodha O, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. In: CVPR
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR
Hong C, Kiong L (2018) Topological Gaussian ARAM for biologically inspired topological map building. Neural Comput Appl 29(4):1055–1072
Jaderberg M, Simonyan K, Zisserman A (2015) Spatial transformer networks. In: Advances in neural information processing systems
Ji Y, Zhang H, Jie Z, Ma L, Wu, QJ (2020) Casnet: a cross-attention siamese network for video salient object detection. In: IEEE transactions on neural networks and learning systems
Kendall A, Grimes M, Cipolla R (2015) Posenet: a convolutional network for real-time 6-dof camera relocalization. In: ICCV
Klein G. Murray D (2007) Parallel tracking and mapping for small ar workspaces. In: Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, IEEE Computer Society, pp 1–10
Li J, Zhang Y, Chen Z, Wang J, Fang M, Luo C, Wang H (2020) A novel edgeenabled slam solution using projected depth image information. Neural Comput Appl 32:15369–15381
Li R, Wang S, Long Z, Gu D (2018) Undeepvo: monocular visual odometry through unsupervised deep learning. In: ICRA
Li S, Cui H, Li Y, Liu B, Lou Y (2013) Decentralized control of collaborative redundant manipulators with partial command coverage via locally connected recurrent neural networks. Neural Comput Appl 23(3):1051–1060
Li Y, Li S, Ge Y (2013) A biologically inspired solution to simultaneous localization and consistent mapping in dynamic environments. Neurocomputing 104:170–179
Li Y, Ushiku Y, Harada T (2019) Pose graph optimization for unsupervised monocular visual odometry. In: ICRA
Li Y, Zhang J, Li S (2018) Stmvo: biologically inspired monocular visual odometry. Neural Comput Appl 29(6):215–225
Mahjourian R, Wicke M, Angelova A (2018) Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In: CVPR
Mayer N, Ilg E, Hausser P, Fischer P, Cremers D, Dosovitskiy A, Brox T (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: CVPR
Menze M, Geiger A (2015) Object scene flow for autonomous vehicles. In: CVPR
Mitic M, Vukovic N, Petrovic M, Miljkovic Z (2018) Chaotic metaheuristic algorithms for learning and reproduction of robot motion trajectories. Neural Comput Appl 30(4):1065–1083
Mur-Artal R, Tardós JD (2017) ORB-SLAM2: an open-source slam system for monocular, stereo, and RGB-D cameras. In: IEEE Transactions on Robotics
Parmar N, Vaswani A, Uszkoreit J, Kaiser Ł, Shazeer N, Ku A, Tran D (2018) Image transformer. arXiv preprint arXiv:1802.05751
Pilzer A, Lathuiliere S, Sebe N, Ricci E (2019) Refine and distill: exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation. In: CVPR
Qin T, Pan J, Cao S, Shen S (2019) A general optimization-based framework for local odometry estimation with multiple sensors. arXiv preprint arXiv:1901.03638
Roberts RJW (2014) Optical flow templates for mobile robot environment understanding. Ph.D. thesis, Georgia Institute of Technology
Shen T, Luo Z, Zhou L, Deng H, Zhang R, Fang T, Quan L (2019) Beyond photometric loss for self-supervised ego-motion estimation. In: ICRA
Sun D, Yang X, Liu MY, Kautz J (2018) PWC-NET: CNNS for optical flow using pyramid, warping, and cost volume. In: CVPR
Ummenhofer B, Zhou H, Uhrig J, Mayer N, Ilg E, Dosovitskiy A, Brox T (2017) Demon: depth and motion network for learning monocular stereo. In: CVPR
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems
Wang R, Pizer SM, Frahm JM (2019) Recurrent neural network for (un-) supervised learning of monocular video visual odometry and depth. In: CVPR
Wang S, Clark R, Wen H, Trigoni N (2017) Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In: ICRA
Wang S, Clark R, Wen H, Trigoni N (2018) End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks. Int J Robot Res 37(4–5):513–542
Wang Y, Wang P, Yang Z, Luo C, Yang Y, Xu W (2019) Unos: unified unsupervised optical-flow and stereo-depth estimation by watching videos. In: CVPR
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. In: TIP
Wong A, Soatto S (2019) Bilateral cyclic constraint and adaptive regularization for unsupervised monocular depth prediction. In: CVPR
Xue F, Wang X, Li S, Wang Q, Wang J, Zha H (2019) Beyond tracking: selecting memory and refining poses for deep visual odometry. In: CVPR
Yin Z, Shi J (2018) Geonet: unsupervised learning of dense depth, optical flow and camera pose. In: CVPR
Zhan H, Garg R, Weerasekera CS, Li K, Agarwal H, Reid I (2018) Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: CVPR
Zhou H, Ummenhofer B, Brox T (2018) Deeptam: deep tracking and mapping. In: ECCV
Zhou T, Brown M, Snavely N, Lowe DG (2017) Unsupervised learning of depth and ego-motion from video. In: CVPR
Zou Y, Luo Z, Huang JB (2018) Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In: ECCV
Acknowledgements
This study was funded by National Natural Science Foundation of China (Grant number 61906173, 61822701).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, X., Hou, Y., Wang, P. et al. Transformer guided geometry model for flow-based unsupervised visual odometry. Neural Comput & Applic 33, 8031–8042 (2021). https://doi.org/10.1007/s00521-020-05545-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-020-05545-8