DAttNet: monocular depth estimation network based on attention mechanisms

Armando Astudillo ORCID: orcid.org/0000-0003-3562-9215¹,
Alejandro Barrera¹,
Carlos Guindel¹,
Abdulla Al-Kaff¹ &
…
Fernando García¹

622 Accesses
2 Altmetric
Explore all metrics

Abstract

As autonomous vehicles get closer to our daily lives, the need for architectures that function as redundant pipelines is becoming increasingly critical. To address this issue without compromising the budget, researchers aim to avoid duplicating high-cost sensors such as LiDARs. In this work, we propose using monocular cameras, which are already essential for some modules of the autonomous platform, for 3D scene understanding. While many methods for depth estimation using single images have been proposed in the literature, they usually rely on complex neural network ensembles that extract dense feature maps, resulting in a high computational cost. Instead, we propose a novel and inherently efficient method for obtaining depth images that replace tangled neural architectures with attention mechanisms applied to basic encoder–decoder models. We evaluate our method on the KITTI public dataset and in real-world experiments on our automated vehicle. The obtained results prove the viability of our approach, which can compete with intricate state-of-the-art methods while outperforming most alternatives based on attention mechanisms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Depth Estimation Based on Monocular Camera Sensors in Autonomous Vehicles: A Self-supervised Learning Approach

Article Open access 12 April 2023

Self-supervised monocular depth estimation via joint attention and intelligent mask loss

Article 28 November 2024

Designing and Searching for Lightweight Monocular Depth Network

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The authors declare that the dataset used for training and validating the results presented in this study is openly accessible and available at: https://www.cvlibs.net/datasets/kitti/eval_depth.php?benchmark=depth_prediction [26].

Notes

Additional results: https://www.youtube.com/watch?v=pQDc_AimYiU.

References

Beltrán J, Guindel C, Cortés I, Barrera A, Astudillo A, Urdiales J, Álvarez M, Bekka F, Milanés V, García F (2020) Towards autonomous driving: a multi-modal 360° perception proposal. In: 2020 IEEE 23rd international conference on intelligent transportation systems (ITSC), pp 3295–3300 (2020). https://doi.org/10.1109/ITSC45102.2020.9294494
Liang M, Yang B, Zeng W, Chen Y, Hu R, Casas S, Urtasun R (2020) PnPNet: End-to-end perception and prediction with tracking in the loop. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 11550–11559. https://doi.org/10.1109/CVPR42600.2020.01157
Astudillo A, Molina N, Cortés I, Mahtout I, González D, Beltrán J, Guindel C, Barrera A, Álvarez M, Zinoune C, Milanés V, García F (2021) Visibility-aware adaptative speed planner for human-like navigation in roundabouts. In: 2021 IEEE International intelligent transportation systems conference (ITSC), pp. 885–890. https://doi.org/10.1109/ITSC48978.2021.9564451
Pei L, Rui Z (2015) The analysis of stereo vision 3D point cloud data of autonomous vehicle obstacle recognition. In: 2015 7th International conference on intelligent human-machine systems and cybernetics, vol. 2, pp. 207–210. https://doi.org/10.1109/IHMSC.2015.192
Doval GN, Al-Kaff A, Beltrán J, Fernández FG, Fernández López G (2019) Traffic sign detection and 3D localization via deep convolutional neural networks and stereo vision. In: 2019 IEEE intelligent transportation systems conference (ITSC), pp. 1411–1416. https://doi.org/10.1109/ITSC.2019.8916958
Cheng B, Collins MD, Zhu Y, Liu T, Huang TS, Adam H, Chen LC (2020) Panoptic-DeepLab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 12472–12482. https://doi.org/10.1109/CVPR42600.2020.01249
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision - ECCV 2018. Springer, Cham, pp 833–851
Chapter Google Scholar
Miguel MA, Moreno FM, Marín-Plaza P, Al-Kaff A, Palos M, Martín Gómez D, Encinar-Martín R, Garcia F (2020) A research platform for autonomous vehicles technologies research in the insurance sector. Appl Sci 10:5655. https://doi.org/10.3390/app10165655
Article CAS Google Scholar
Scharstein D, Szeliski R, Zabih R (2001) A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. In: Proceedings IEEE workshop on stereo and multi-baseline vision (SMBV 2001), pp. 131–140. https://doi.org/10.1109/SMBV.2001.988771
Khamis S, Fanello S, Rhemann C, Kowdle A, Valentin J, Izadi S (2018) StereoNet: guided hierarchical refinement for real-time edge-aware depth prediction. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision - ECCV 2018. Springer, Cham, pp 596–613
Chapter Google Scholar
Xu H, Zhang J (2020) AANet: Adaptive aggregation network for efficient stereo matching. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 1956–1965. https://doi.org/10.1109/CVPR42600.2020.00203
Godard C, Aodha OM, Firman M, Brostow G (2019) Digging into self-supervised monocular depth estimation. In: 2019 IEEE/CVF International conference on computer vision (ICCV), pp. 3827–3837. https://doi.org/10.1109/ICCV.2019.00393
Lee JH, Han MK, Ko DW, Suh IH (2021) From big to small: multi-scale local planar guidance for monocular depth estimation. arXiv. arXiv:1907.10326 [cs]. https://doi.org/10.48550/arXiv.1907.10326
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Patt Anal Mach Intell 40(4):834–848. https://doi.org/10.1109/TPAMI.2017.2699184
Article Google Scholar
Galassi A, Lippi M, Torroni P (2021) Attention in natural language processing. IEEE Trans Neural Netw Learn Syst 32(10):4291–4308. https://doi.org/10.1109/TNNLS.2020.3019893
Article PubMed Google Scholar
Lu Y, Hao X, Li Y, Chai W, Sun S, Velipasalar S (2022) Range-aware attention network for lidar-based 3d object detection with auxiliary point density level estimation. arXiv. arXiv:2111.09515 [cs]. https://doi.org/10.48550/arXiv.2111.09515
Chen Y, Zhao H, Hu Z, Peng J (2021) Attention-based context aggregation network for monocular depth estimation. Int J Mach Learn Cybern 12:1583–1596. https://doi.org/10.1007/s13042-020-01251-y
Article Google Scholar
Fu H, Gong M, Wang C, Batmanghelich K, Tao D (2018) Deep ordinal regression network for monocular depth estimation. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition, pp. 2002–2011. https://doi.org/10.1109/CVPR.2018.00214
Song X, Li W, Zhou D, Dai Y, Fang J, Li H, Zhang L (2021) MLDA-Net: multi-level dual attention-based network for self-supervised monocular depth estimation. IEEE Trans. Image Process. 30:4691–4705. https://doi.org/10.1109/TIP.2021.3074306
Article PubMed ADS Google Scholar
Nah S, Kim TH, Lee KM (2017) Deep multi-scale convolutional neural network for dynamic scene deblurring. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp. 257–265. https://doi.org/10.1109/CVPR.2017.35
Wang Y, Ying X., Wang L, Yang J, An W, Guo Y (2021) Symmetric parallax attention for stereo image super-resolution. In: 2021 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp. 766–775. https://doi.org/10.1109/CVPRW53098.2021.00086
Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-attention generative adversarial networks. In: International conference on machine learning, PMLR, pp. 7354–7363
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems 27
Chiang TH, Chiang MH, Tsai MH, Chang CC (2022) Attention-based background/foreground monocular depth prediction model using image segmentation, vol. 12. https://doi.org/10.3390/app122111186. https://www.mdpi.com/2076-3417/12/21/11186
Yan J, Zhao H, Bu P, Jin Y (2021) Channel-wise attention-based network for self-supervised monocular depth estimation. In: 2021 International conference on 3d vision (3DV), pp. 464–473. https://doi.org/10.1109/3DV53792.2021.00056
Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? the KITTI vision benchmark suite. In: 2012 IEEE Conference on computer vision and pattern recognition, pp. 3354–3361. https://doi.org/10.1109/CVPR.2012.6248074
Xiao P, Shao Z, Hao S, Zhang Z, Chai X, Jiao J, Li Z, Wu J, Sun K, Jiang K, Wang Y, Yang D (2021) Pandaset: Advanced sensor suite dataset for autonomous driving. In: 2021 IEEE International intelligent transportation systems conference (ITSC), pp. 3095–3101. https://doi.org/10.1109/ITSC48978.2021.9565009
Hormann K (2014) Barycentric interpolation. In: Fasshauer GE, Schumaker LL (eds) Approximation Theory XIV: San Antonio 2013. Springer, Cham, pp 197–218
Chapter Google Scholar

Download references

Acknowledgements

This work has been supported by the Madrid Government (Comunidad de Madrid-Spain) under the Multiannual Agreement with UC3M (“Fostering Young Doctors Research”, APBI-CM-UC3M), and in the context of the V PRICIT (Research and Technological Innovation Regional Programme). Carlos Guindel acknowledges the support of the Ministry of Universities and the Universidad Carlos III de Madrid’s Call for Grants for the requalification of the Spanish University System for 2021-2023, based on Royal Decree 289/2021 of April 20, 2021, which regulates the direct granting of subsidies to public universities for the requalification of the Spanish university system. This work has been supported by the Spanish Government through the projects ID2021-128327OA-I00, PID2021-124335OB-C21 and TED2021-129374A-I00 funded by MCIN/AEI/10.13039/501100011033, by the European Union NextGenerationEU/PRTR.

Author information

Authors and Affiliations

Autonomous Mobility and Perception Laboratory (AMPL), Universidad Carlos III de Madrid, Madrid, Spain
Armando Astudillo, Alejandro Barrera, Carlos Guindel, Abdulla Al-Kaff & Fernando García

Authors

Armando Astudillo
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro Barrera
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Guindel
View author publications
You can also search for this author in PubMed Google Scholar
Abdulla Al-Kaff
View author publications
You can also search for this author in PubMed Google Scholar
Fernando García
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Armando Astudillo.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Astudillo, A., Barrera, A., Guindel, C. et al. DAttNet: monocular depth estimation network based on attention mechanisms. Neural Comput & Applic 36, 3347–3356 (2024). https://doi.org/10.1007/s00521-023-09210-8

Download citation

Received: 03 May 2023
Accepted: 21 October 2023
Published: 13 December 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s00521-023-09210-8

DAttNet: monocular depth estimation network based on attention mechanisms

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Depth Estimation Based on Monocular Camera Sensors in Autonomous Vehicles: A Self-supervised Learning Approach

Self-supervised monocular depth estimation via joint attention and intelligent mask loss

Designing and Searching for Lightweight Monocular Depth Network

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

DAttNet: monocular depth estimation network based on attention mechanisms

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Depth Estimation Based on Monocular Camera Sensors in Autonomous Vehicles: A Self-supervised Learning Approach

Self-supervised monocular depth estimation via joint attention and intelligent mask loss

Designing and Searching for Lightweight Monocular Depth Network

Explore related subjects

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now