PDANet: Self-Supervised Monocular Depth Estimation Using Perceptual and Data Augmentation Consistency
<p>Illustration of the <span class="html-italic">color constancy hypothesis</span> problem in previous self-supervised depth estimation task, where <math display="inline"><semantics> <msup> <mi>I</mi> <mi>l</mi> </msup> </semantics></math> and <math display="inline"><semantics> <msup> <mi>I</mi> <mi>r</mi> </msup> </semantics></math> represent the left and right view, <math display="inline"><semantics> <msup> <mi>d</mi> <mi>r</mi> </msup> </semantics></math> represents the disparity map of the right view, <math display="inline"><semantics> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </semantics></math> represents the coordinates on an image, and RGB represents the RGB value on each pixel.</p> "> Figure 2
<p>Examples of PDANet’s results on KITTI2015 [<a href="#B6-applsci-11-05383" class="html-bibr">6</a>]. From left to right: input images, ground truth disparity maps, and results of PDANet.</p> "> Figure 3
<p>Illustration of the overall architecture of PDANet, which consists of three components: data augmentation consistency, photometric consistency, and perceptual consistency.</p> "> Figure 4
<p>Visualization of a feature fusion sample result and a brief illustration of feature aggregation. For each level, each layer of the original feature pyramid is resized to specific size and the number of channels and fused with specific weights.</p> "> Figure 5
<p>Comparison between PDANet and Monodepth on KITTI 2015 [<a href="#B6-applsci-11-05383" class="html-bibr">6</a>]. PDANet achieves much better performance than Monodepth in terms of edges, land marks, and low-texture areas.</p> "> Figure 6
<p>Results on the Cityscapes dataset [<a href="#B37-applsci-11-05383" class="html-bibr">37</a>]. The strongly reflective Mercedes-Benz logo is removed and only the top 80% of the output image is retained.</p> "> Figure 7
<p>Results and comparison on Make3D dataset [<a href="#B38-applsci-11-05383" class="html-bibr">38</a>].</p> "> Figure 8
<p>Visualization of ablation study. From left to right: input images, the result of Monodepth, the result of Monodepth with data augmentation consistency, the result of Monodepth with perceptual consistency, and the result of PDANet, where <math display="inline"><semantics> <msub> <mi mathvariant="script">L</mi> <mrow> <mi>a</mi> <mi>u</mi> <mi>g</mi> </mrow> </msub> </semantics></math> is the data augmentation consistency loss and <math display="inline"><semantics> <msub> <mi mathvariant="script">L</mi> <mrow> <mi>p</mi> <mi>e</mi> <mi>r</mi> </mrow> </msub> </semantics></math> is the perceptual consistency loss. Blue patches show the different effects in detail when applying each module separately.</p> ">
Abstract
:1. Introduction
- Based on the baseline monocular depth estimation network, we integrate data augmentation consistency and perceptual consistency as supervised signals to overcome the color constancy hypothesis and image gradient disappearance in low-texture regions for previous work.
- We propose a new unsupervised loss based on perceptual consistency to excavate the deep semantic similarity before and after image reconstruction so that the model still obtains genuine image differences in low texture and color fluctuation regions.
- We also innovatively propose a heavy data augmentation consistency to solve the color fluctuation problem and greatly enhance the generalization of the model.
2. Related Work
2.1. Supervised Depth Estimation
2.2. Self-Supervised Depth Estimation
3. Method
3.1. Backbone of Depth Estimation
3.1.1. Reconstruction Loss
3.1.2. Disparity Smoothness Loss
3.1.3. Left–Right (LR) Consistency Loss
3.2. Perceptual Consistency
3.2.1. Feature Extraction
3.2.2. Feature Aggregation
3.2.3. Perceptual Loss
3.3. Data Augmentation Consistency
3.3.1. Random Masking
3.3.2. Gamma Correction
3.3.3. Color Jitter and Blur
3.4. Overall Architecture and Loss
4. Results
4.1. Implementation Details
4.1.1. Post Processing
4.1.2. Low-Texture Regions
4.2. KITTI
4.3. Other Datasets
4.3.1. Cityscapes
4.3.2. Make3D
4.4. Ablation Study
4.4.1. Effect of Perceptual Consistency
4.4.2. Effect of Data Augmentation Consistency
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
SfM | Structure from Motion |
SLAM | Simultaneous Localization and Mapping |
VGG16 | Visual Geometry Group Network with 16 convolutional layers |
LiDAR | Light Detection and Ranging |
RGB | Red Green and Blue |
CNN | Convolution Neural Network |
Monodepth | Monocular Depth Estimation Network proposed by Godard et al. |
SSIM | Structural Similarity |
L1 | Least Absolute Deviation |
PP | Post Processing |
RMSE | Root Mean Square Error |
abs_rel | Absolute Relative Error |
sq_rel | Square Relative Error |
GT | Ground Truth |
References
- Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. DTAM: Dense tracking and mapping in real-time. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2320–2327. [Google Scholar]
- Kak, A.; DeSouza, G. Vision for mobile robot navigation. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 237–267. [Google Scholar]
- Menze, M.; Geiger, A. Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3061–3070. [Google Scholar]
- Engel, J.; Koltun, V.; Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef] [PubMed]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. arXiv 2014, arXiv:1406.2283. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
- Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2650–2658. [Google Scholar]
- Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar]
- Cao, Y.; Wu, Z.; Shen, C. Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans. Circuits Syst. Video Technol. 2017, 28, 3174–3182. [Google Scholar] [CrossRef] [Green Version]
- Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2002–2011. [Google Scholar]
- Zhan, H.; Garg, R.; Weerasekera, C.S.; Li, K.; Agarwal, H.; Reid, I. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 340–349. [Google Scholar]
- Chen, W.; Fu, Z.; Yang, D.; Deng, J. Single-image depth perception in the wild. arXiv 2016, arXiv:1604.03901. [Google Scholar]
- Wu, Y.; Ying, S.; Zheng, L. Size-to-depth: A new perspective for single image depth estimation. arXiv 2018, arXiv:1801.04461. [Google Scholar]
- Garg, R.; Bg, V.K.; Carneiro, G.; Reid, I. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 740–756. [Google Scholar]
- Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 270–279. [Google Scholar]
- Tosi, F.; Aleotti, F.; Poggi, M.; Mattoccia, S. Learning monocular depth estimation infusing traditional stereo knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9799–9809. [Google Scholar]
- Wong, A.; Soatto, S. Bilateral cyclic constraint and adaptive regularization for unsupervised monocular depth prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5644–5653. [Google Scholar]
- Poggi, M.; Tosi, F.; Mattoccia, S. Learning monocular depth estimation with unsupervised trinocular assumptions. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 324–333. [Google Scholar]
- Pilzer, A.; Xu, D.; Puscas, M.; Ricci, E.; Sebe, N. Unsupervised adversarial depth estimation using cycled generative networks. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 587–595. [Google Scholar]
- CS Kumar, A.; Bhandarkar, S.M.; Prasad, M. Monocular depth prediction using generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 300–308. [Google Scholar]
- Zhai, M.; Xiang, X.; Lv, N.; Kong, X.; El Saddik, A. An object context integrated network for joint learning of depth and optical flow. IEEE Trans. Image Process. 2020, 29, 7807–7818. [Google Scholar] [CrossRef]
- Shu, C.; Yu, K.; Duan, Z.; Yang, K. Feature-metric loss for self-supervised learning of depth and egomotion. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 572–588. [Google Scholar]
- Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 694–711. [Google Scholar]
- Yang, G.; Zhao, H.; Shi, J.; Deng, Z.; Jia, J. Segstereo: Exploiting semantic information for disparity estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 8–14 September 2018; pp. 636–651. [Google Scholar]
- Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
- Xie, Q.; Dai, Z.; Hovy, E.; Luong, M.T.; Le, Q.V. Unsupervised data augmentation for consistency training. arXiv 2019, arXiv:1904.12848. [Google Scholar]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PMLR, Cambridge, MA, USA, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
- Dovesi, P.L.; Poggi, M.; Andraghetti, L.; Martí, M.; Kjellström, H.; Pieropan, A.; Mattoccia, S. Real-time semantic stereo matching. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 10780–10787. [Google Scholar]
- Xu, D.; Wang, W.; Tang, H.; Liu, H.; Sebe, N.; Ricci, E. Structured attention guided convolutional neural fields for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3917–3925. [Google Scholar]
- Zhang, H.; Shen, C.; Li, Y.; Cao, Y.; Liu, Y.; Yan, Y. Exploiting temporal consistency for real-time video depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 1725–1734. [Google Scholar]
- Amiri, A.J.; Loo, S.Y.; Zhang, H. Semi-supervised monocular depth estimation with left-right consistency using deep neural network. In Proceedings of the 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO), Dali, China, 6–8 December 2019; pp. 602–607. [Google Scholar]
- Ranjan, A.; Jampani, V.; Balles, L.; Kim, K.; Sun, D.; Wulff, J.; Black, M.J. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12240–12249. [Google Scholar]
- Chen, P.Y.; Liu, A.H.; Liu, Y.C.; Wang, Y.C.F. Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2624–2632. [Google Scholar]
- Zhou, J.; Wang, Y.; Qin, K.; Zeng, W. Unsupervised high-resolution depth learning from videos with dual networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6872–6881. [Google Scholar]
- Casser, V.; Pirk, S.; Mahjourian, R.; Angelova, A. Unsupervised monocular depth and ego-motion learning with structure and semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Chen, Y.; Schmid, C.; Sminchisescu, C. Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 7063–7072. [Google Scholar]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
- Saxena, A.; Sun, M.; Ng, A.Y. Make3d: Learning 3d scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 824–840. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Karsch, K.; Liu, C.; Kang, S.B. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2144–2158. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Supervised | Unsupervised | PDANet | |
---|---|---|---|
Ground Truth | Needed | Not Needed | Not Needed |
Acquisition of Depth Map | Hard | - | - |
Generalization ability | Low | - | - |
Color Constancy Hypothesis Problems | No | Yes | No |
low-texture regions Problems | No | Yes | No |
Quality of Prediction | High | Low | Relatively High |
Methods | Strategy | ||||||
---|---|---|---|---|---|---|---|
Eigen [6] | supervised | 0.190 | 1.515 | 7.156 | 0.692 | 0.899 | 0.967 |
Cao [28] | supervised | 0.180 | 6.311 | 0.771 | 0.917 | 0.966 | |
Xu [29] | supervised | 0.122 | 0.897 | 4.677 | 0.818 | 0.954 | 0.985 |
Zhang [30] | supervised | 0.101 | 4.137 | 0.890 | 0.970 | 0.989 | |
Amiri [31] | supervised | 0.096 | 0.552 | 3.995 | 0.892 | 0.972 | 0.992 |
Garg [12] | unsupervised | 0.169 | 5.104 | 0.740 | 0.904 | 0.958 | |
Godard [13] | unsupervised | 0.114 | 4.935 | 0.861 | 0.949 | 0.976 | |
Zhan [11] | unsupervised | 0.151 | 1.257 | 5.583 | 0.810 | 0.936 | 0.974 |
Ranjan [32] | unsupervised | 0.140 | 1.070 | 5.326 | 0.826 | 0.941 | 0.975 |
Poggi [18] | unsupervised | 0.126 | 0.961 | 5.205 | 0.835 | 0.941 | 0.974 |
Wong [14] | unsupervised | 0.133 | 1.126 | 5.515 | 0.826 | 0.934 | 0.969 |
Chen [33] | unsupervised | 0.118 | 0.905 | 5.096 | 0.839 | 0.945 | 0.977 |
Zhou [34] | unsupervised | 0.121 | 0.837 | 4.945 | 0.853 | 0.955 | 0.982 |
Casser [35] | unsupervised | 0.108 | 0.825 | 4.750 | 0.873 | 0.957 | 0.982 |
Chen [36] | unsupervised | 0.099 | 0.796 | 4.743 | 0.884 | 0.955 | 0.979 |
Tosi [16] | unsupervised | 0.111 | 0.867 | 4.714 | 0.864 | 0.954 | 0.979 |
PDANet | unsupervised | 0.084 | 0.961 | 4.701 | 0.916 | 0.970 | 0.988 |
Method | Strategy | ||||
---|---|---|---|---|---|
Train set mean | supervised | 15.517 | 0.893 | 11.542 | 0.223 |
Karsch [39] | supervised | 4.894 | 0.417 | 8.172 | 0.144 |
Laina berHu [8] | supervised | 1.665 | 0.198 | 5.461 | 0.082 |
Monodepth | unsupervised | 9.515 | 0.521 | 10.733 | 0.238 |
Monodepth + pp | unsupervised | 5.808 | 0.437 | 8.966 | 0.216 |
Ours | unsupervised | 8.497 | 0.491 | 10.525 | 0.233 |
Ours + pp | unsupervised | 5.054 | 0.418 | 8.671 | 0.212 |
pp | |||||||||
---|---|---|---|---|---|---|---|---|---|
0.1016 | 1.7372 | 5.581 | 17.200 | 0.905 | 0.964 | 0.983 | |||
✓ | 0.0920 | 1.2444 | 5.127 | 17.382 | 0.909 | 0.969 | 0.987 | ||
✓ | 0.1013 | 1.7183 | 5.539 | 17.154 | 0.905 | 0.964 | 0.983 | ||
✓ | ✓ | 0.0921 | 1.2345 | 4.996 | 17.890 | 0.912 | 0.970 | 0.987 | |
✓ | 0.0948 | 1.3093 | 5.264 | 17.089 | 0.908 | 0.965 | 0.985 | ||
✓ | ✓ | 0.0855 | 1.0402 | 4.797 | 16.514 | 0.915 | 0.971 | 0.988 | |
✓ | ✓ | 0.0916 | 1.1597 | 5.150 | 16.909 | 0.910 | 0.966 | 0.985 | |
✓ | ✓ | ✓ | 0.0846 | 0.9613 | 4.701 | 16.683 | 0.916 | 0.970 | 0.988 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gao, H.; Liu, X.; Qu, M.; Huang, S. PDANet: Self-Supervised Monocular Depth Estimation Using Perceptual and Data Augmentation Consistency. Appl. Sci. 2021, 11, 5383. https://doi.org/10.3390/app11125383
Gao H, Liu X, Qu M, Huang S. PDANet: Self-Supervised Monocular Depth Estimation Using Perceptual and Data Augmentation Consistency. Applied Sciences. 2021; 11(12):5383. https://doi.org/10.3390/app11125383
Chicago/Turabian StyleGao, Huachen, Xiaoyu Liu, Meixia Qu, and Shijie Huang. 2021. "PDANet: Self-Supervised Monocular Depth Estimation Using Perceptual and Data Augmentation Consistency" Applied Sciences 11, no. 12: 5383. https://doi.org/10.3390/app11125383
APA StyleGao, H., Liu, X., Qu, M., & Huang, S. (2021). PDANet: Self-Supervised Monocular Depth Estimation Using Perceptual and Data Augmentation Consistency. Applied Sciences, 11(12), 5383. https://doi.org/10.3390/app11125383