Abstract
As autonomous vehicles get closer to our daily lives, the need for architectures that function as redundant pipelines is becoming increasingly critical. To address this issue without compromising the budget, researchers aim to avoid duplicating high-cost sensors such as LiDARs. In this work, we propose using monocular cameras, which are already essential for some modules of the autonomous platform, for 3D scene understanding. While many methods for depth estimation using single images have been proposed in the literature, they usually rely on complex neural network ensembles that extract dense feature maps, resulting in a high computational cost. Instead, we propose a novel and inherently efficient method for obtaining depth images that replace tangled neural architectures with attention mechanisms applied to basic encoder–decoder models. We evaluate our method on the KITTI public dataset and in real-world experiments on our automated vehicle. The obtained results prove the viability of our approach, which can compete with intricate state-of-the-art methods while outperforming most alternatives based on attention mechanisms.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The authors declare that the dataset used for training and validating the results presented in this study is openly accessible and available at: https://www.cvlibs.net/datasets/kitti/eval_depth.php?benchmark=depth_prediction [26].
Notes
Additional results: https://www.youtube.com/watch?v=pQDc_AimYiU.
References
Beltrán J, Guindel C, Cortés I, Barrera A, Astudillo A, Urdiales J, Álvarez M, Bekka F, Milanés V, García F (2020) Towards autonomous driving: a multi-modal 360° perception proposal. In: 2020 IEEE 23rd international conference on intelligent transportation systems (ITSC), pp 3295–3300 (2020). https://doi.org/10.1109/ITSC45102.2020.9294494
Liang M, Yang B, Zeng W, Chen Y, Hu R, Casas S, Urtasun R (2020) PnPNet: End-to-end perception and prediction with tracking in the loop. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 11550–11559. https://doi.org/10.1109/CVPR42600.2020.01157
Astudillo A, Molina N, Cortés I, Mahtout I, González D, Beltrán J, Guindel C, Barrera A, Álvarez M, Zinoune C, Milanés V, García F (2021) Visibility-aware adaptative speed planner for human-like navigation in roundabouts. In: 2021 IEEE International intelligent transportation systems conference (ITSC), pp. 885–890. https://doi.org/10.1109/ITSC48978.2021.9564451
Pei L, Rui Z (2015) The analysis of stereo vision 3D point cloud data of autonomous vehicle obstacle recognition. In: 2015 7th International conference on intelligent human-machine systems and cybernetics, vol. 2, pp. 207–210. https://doi.org/10.1109/IHMSC.2015.192
Doval GN, Al-Kaff A, Beltrán J, Fernández FG, Fernández López G (2019) Traffic sign detection and 3D localization via deep convolutional neural networks and stereo vision. In: 2019 IEEE intelligent transportation systems conference (ITSC), pp. 1411–1416. https://doi.org/10.1109/ITSC.2019.8916958
Cheng B, Collins MD, Zhu Y, Liu T, Huang TS, Adam H, Chen LC (2020) Panoptic-DeepLab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 12472–12482. https://doi.org/10.1109/CVPR42600.2020.01249
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision - ECCV 2018. Springer, Cham, pp 833–851
Miguel MA, Moreno FM, Marín-Plaza P, Al-Kaff A, Palos M, Martín Gómez D, Encinar-Martín R, Garcia F (2020) A research platform for autonomous vehicles technologies research in the insurance sector. Appl Sci 10:5655. https://doi.org/10.3390/app10165655
Scharstein D, Szeliski R, Zabih R (2001) A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. In: Proceedings IEEE workshop on stereo and multi-baseline vision (SMBV 2001), pp. 131–140. https://doi.org/10.1109/SMBV.2001.988771
Khamis S, Fanello S, Rhemann C, Kowdle A, Valentin J, Izadi S (2018) StereoNet: guided hierarchical refinement for real-time edge-aware depth prediction. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision - ECCV 2018. Springer, Cham, pp 596–613
Xu H, Zhang J (2020) AANet: Adaptive aggregation network for efficient stereo matching. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 1956–1965. https://doi.org/10.1109/CVPR42600.2020.00203
Godard C, Aodha OM, Firman M, Brostow G (2019) Digging into self-supervised monocular depth estimation. In: 2019 IEEE/CVF International conference on computer vision (ICCV), pp. 3827–3837. https://doi.org/10.1109/ICCV.2019.00393
Lee JH, Han MK, Ko DW, Suh IH (2021) From big to small: multi-scale local planar guidance for monocular depth estimation. arXiv. arXiv:1907.10326 [cs]. https://doi.org/10.48550/arXiv.1907.10326
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Patt Anal Mach Intell 40(4):834–848. https://doi.org/10.1109/TPAMI.2017.2699184
Galassi A, Lippi M, Torroni P (2021) Attention in natural language processing. IEEE Trans Neural Netw Learn Syst 32(10):4291–4308. https://doi.org/10.1109/TNNLS.2020.3019893
Lu Y, Hao X, Li Y, Chai W, Sun S, Velipasalar S (2022) Range-aware attention network for lidar-based 3d object detection with auxiliary point density level estimation. arXiv. arXiv:2111.09515 [cs]. https://doi.org/10.48550/arXiv.2111.09515
Chen Y, Zhao H, Hu Z, Peng J (2021) Attention-based context aggregation network for monocular depth estimation. Int J Mach Learn Cybern 12:1583–1596. https://doi.org/10.1007/s13042-020-01251-y
Fu H, Gong M, Wang C, Batmanghelich K, Tao D (2018) Deep ordinal regression network for monocular depth estimation. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp. 2002–2011. https://doi.org/10.1109/CVPR.2018.00214
Song X, Li W, Zhou D, Dai Y, Fang J, Li H, Zhang L (2021) MLDA-Net: multi-level dual attention-based network for self-supervised monocular depth estimation. IEEE Trans. Image Process. 30:4691–4705. https://doi.org/10.1109/TIP.2021.3074306
Nah S, Kim TH, Lee KM (2017) Deep multi-scale convolutional neural network for dynamic scene deblurring. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp. 257–265. https://doi.org/10.1109/CVPR.2017.35
Wang Y, Ying X., Wang L, Yang J, An W, Guo Y (2021) Symmetric parallax attention for stereo image super-resolution. In: 2021 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp. 766–775. https://doi.org/10.1109/CVPRW53098.2021.00086
Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-attention generative adversarial networks. In: International conference on machine learning, PMLR, pp. 7354–7363
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems 27
Chiang TH, Chiang MH, Tsai MH, Chang CC (2022) Attention-based background/foreground monocular depth prediction model using image segmentation, vol. 12. https://doi.org/10.3390/app122111186. https://www.mdpi.com/2076-3417/12/21/11186
Yan J, Zhao H, Bu P, Jin Y (2021) Channel-wise attention-based network for self-supervised monocular depth estimation. In: 2021 International conference on 3d vision (3DV), pp. 464–473. https://doi.org/10.1109/3DV53792.2021.00056
Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? the KITTI vision benchmark suite. In: 2012 IEEE Conference on computer vision and pattern recognition, pp. 3354–3361. https://doi.org/10.1109/CVPR.2012.6248074
Xiao P, Shao Z, Hao S, Zhang Z, Chai X, Jiao J, Li Z, Wu J, Sun K, Jiang K, Wang Y, Yang D (2021) Pandaset: Advanced sensor suite dataset for autonomous driving. In: 2021 IEEE International intelligent transportation systems conference (ITSC), pp. 3095–3101. https://doi.org/10.1109/ITSC48978.2021.9565009
Hormann K (2014) Barycentric interpolation. In: Fasshauer GE, Schumaker LL (eds) Approximation Theory XIV: San Antonio 2013. Springer, Cham, pp 197–218
Acknowledgements
This work has been supported by the Madrid Government (Comunidad de Madrid-Spain) under the Multiannual Agreement with UC3M (“Fostering Young Doctors Research”, APBI-CM-UC3M), and in the context of the V PRICIT (Research and Technological Innovation Regional Programme). Carlos Guindel acknowledges the support of the Ministry of Universities and the Universidad Carlos III de Madrid’s Call for Grants for the requalification of the Spanish University System for 2021-2023, based on Royal Decree 289/2021 of April 20, 2021, which regulates the direct granting of subsidies to public universities for the requalification of the Spanish university system. This work has been supported by the Spanish Government through the projects ID2021-128327OA-I00, PID2021-124335OB-C21 and TED2021-129374A-I00 funded by MCIN/AEI/10.13039/501100011033, by the European Union NextGenerationEU/PRTR.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Astudillo, A., Barrera, A., Guindel, C. et al. DAttNet: monocular depth estimation network based on attention mechanisms. Neural Comput & Applic 36, 3347–3356 (2024). https://doi.org/10.1007/s00521-023-09210-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-09210-8