research-article

RCFNet: : Related cross-level feature network with cascaded self-distillation for monocular depth estimation

Authors:

Yan ZhangAuthors Info & Claims

Volume 154, Issue C

https://doi.org/10.1016/j.dsp.2024.104681

Published: 01 November 2024 Publication History

Abstract

Monocular depth estimation (MDE) is a challenging yet crucial computer vision task, which aims to generate accurate depth maps from a single image. Existing MDE approaches mainly rely on extracting and fusing diverse information from multi-level features to improve prediction accuracy. However, these methods often apply a traditional feature pyramid structure, neglecting a comprehensive exploration of feature fusion paths across multiple levels. Moreover, a single-feature fusion strategy has limited ability to optimize the network. We propose a novel related cross-level feature network (RCFNet) with cascaded self-distillation for monocular depth estimation, including a cross-level feature enhancement (CLFE) module, a hierarchical feature cross refinement (HFCR) module, and a cascaded self-distillation (CSD) module. The CLFE module integrates cross-level features to further exploit the highest-level features, where a channel attention mechanism with hybrid weight operations is deployed to enhance the initial features. The HFCR module adaptively captures strongly correlated complementary information through a window-based multi-head cross-attention mechanism to generate refined features. Meanwhile, the CSD module with a hierarchical feature transformation loss is proposed, which can be viewed as a virtual teacher to progressively extract discriminative features within the network for better gradient flow improvement. Extensive experiments on NYUv2 and KITTI datasets demonstrate that our method outperforms existing SOTA MDE methods in terms of accuracy capacity and robustness.

References

[1]

S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, et al., Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera, in: Proceedings of the ACM Symposium on User Interface Software and Technology, 2011, pp. 559–568.

[2]

A. Geiger, L. Philip, U. Raquel, Are we ready for autonomous driving? The kitti vision benchmark suite, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 3354–3361.

[3]

D. Hoiem, A.N. Stein, A.A. Efros, M. Hebert, Recovering occlusion boundaries from a single image, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2007, pp. 1–8.

[4]

B. Liu, S. Gould, D. Koller, Single image depth estimation from predicted semantic labels, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 1253–1260.

[5]

M. Liu, M. Salzmann, X. He, Discrete-continuous depth estimation from a single image, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 716–723.

[6]

D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single image using a multi-scale deep network, Neural Inf. Process. Syst. 27 (2014) 2366–2374.

[7]

D. Xu, E. Ricci, W. Ouyang, X. Wang, N. Sebe, Multi-scale continuous crfs as sequential deep networks for monocular depth estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5354–5362.

[8]

S.F. Bhat, I. Alhashim, P. Wonka, Adabins: depth estimation using adaptive bins, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 4009–4018.

[9]

D. Eigen, R. Fergus, Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2650–2658.

[10]

W. Yin, Y. Liu, C. Shen, Y. Yan, Enforcing geometric constraints of virtual normal for depth prediction, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 5684–5693.

[11]

F. Liu, C. Shen, G. Lin, I. Reid, Learning depth from single monocular images using deep convolutional neural fields, IEEE Trans. Pattern Anal. Mach. Intell. 38 (10) (2015) 2024–2039.

Digital Library

[12]

Y. Sun, C. Xia, X. Gao, H. Yan, B. Ge, K.-C. Li, Aggregating dense and attentional multi-scale feature network for salient object detection, Digit. Signal Process. 130 (2022).

[13]

C. Xia, Y. Sun, X. Fang, B. Ge, X. Gao, K.-C. Li, Imsfnet: integrated multi-source feature network for salient object detection, Appl. Intell. 53 (19) (2023) 22228–22248.

[14]

H. Lin, W. Huang, W. Luo, W. Lu, Deepfake detection with multi-scale convolution and vision transformer, Digit. Signal Process. 134 (2023).

[15]

M. Song, S. Lim, W. Kim, Monocular depth estimation using laplacian pyramid-based depth residuals, IEEE Trans. Circuits Syst. Video Technol. 31 (11) (2021) 4381–4393.

Digital Library

[16]

R. Ranftl, A. Bochkovskiy, V. Koltun, Vision transformers for dense prediction, in: Proceedings of the IEEE International Conference on Computer Vision (TCCV), 2021, pp. 12179–12188.

[17]

L. Wang, J. Zhang, Y. Wang, H. Lu, X. Ruan, Cliffnet for monocular depth estimation with hierarchical embedding loss, in: Proceedings of the European Conference on Computer Vision (ECCV), 2020, pp. 316–331.

[18]

Z. Cheng, Y. Zhang, C. Tang, Swin-depth: using transformers and multi-scale fusion for monocular-based depth estimation, IEEE Sens. J. 21 (23) (2021) 26912–26920.

[19]

M. Lee, S. Hwang, C. Park, S. Lee, Edgeconv with attention module for monocular depth estimation, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 2858–2867.

[20]

Z. Fu, S. Hong, M. Liu, H. Laga, M. Bennamoun, F. Boussaid, Y. Guo, Multi-stage information diffusion for joint depth and surface normal estimation, Pattern Recognit. 141 (2023).

[21]

S. Shao, R. Li, Z. Pei, Z. Liu, W. Chen, W. Zhu, X. Wu, B. Zhang, Towards comprehensive monocular depth estimation: multiple heads are better than one, IEEE Trans. Multimed. (2023) 7660–7671.

[22]

J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, H. Lu, Dual attention network for scene segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3146–3154.

[23]

G. Cheng, P. Lai, D. Gao, J. Han, Class attention network for image recognition, Sci. China Inf. Sci. 66 (3) (2023).

[24]

G. Wang, G. Cheng, P. Zhou, J. Han, Cross-level attentive feature aggregation for change detection, IEEE Trans. Circuits Syst. Video Technol. (2023).

[25]

W. Yuan, X. Gu, Z. Dai, S. Zhu, P. Tan, Neural window fully-connected crfs for monocular depth estimation, 2022, pp. 3916–3925.

[26]

C. Ning, H. Gan, Trap attention: monocular depth estimation with manual traps, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 5033–5043.

[27]

A. Agarwal, C. Arora, Attention attention everywhere: monocular depth prediction with skip attention, 2023, pp. 5861–5870.

[28]

A. Pilzer, S. Lathuiliere, N. Sebe, E. Ricci, Refine and distill: exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation, 2019, pp. 9768–9777.

[29]

S. Shao, Z. Pei, W. Chen, R. Li, Z. Liu, Z. Li, Urcdc-depth: uncertainty rectified cross-distillation with cutflip for monocular depth estimation, IEEE Trans. Multimed. (2023).

[30]

I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, N. Navab, Deeper depth prediction with fully convolutional residual networks, in: Proceedings of the International Conference on 3D Vision (3DV), 2016, pp. 239–248.

[31]

H. Fu, M. Gong, C. Wang, K. Batmanghelich, D. Tao, Deep ordinal regression network for monocular depth estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2002–2011.

[32]

Lee, J.H.; Han, M.-K.; Ko, D.W.; Suh, I.H. (2019): From big to small: multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326.

[33]

A. Johnston, G. Carneiro, Self-supervised monocular trained depth estimation using self-attention, and discrete disparity volume, 2020, pp. 4756–4765.

[34]

G. Yang, H. Tang, M. Ding, N. Sebe, E. Ricci, Transformer-based attention networks for continuous pixel-wise prediction, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021, pp. 16269–16279.

[35]

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: transformers for image recognition at scale, in: Proceedings of the International Conference on Learning Representations (ICLR), 2021.

[36]

L. Piccinelli, C. Sakaridis, F. Yu, idisc: internal discretization for monocular depth estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 21477–21487.

[37]

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021, pp. 10012–10022.

[38]

Hinton, G.; Vinyals, O.; Dean, J. (2015): Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

[39]

P. Luo, Z. Zhu, Z. Liu, X. Wang, X. Tang, Face model compression by distilling knowledge from neurons, Proc. AAAI Conf. Artif. Intell. 30 (1) (2016).

[40]

H. Cho, J. Choi, G. Baek, W. Hwang, itkd: interchange transfer-based knowledge distillation for 3d object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 13540–13549.

[41]

D. Ji, H. Wang, M. Tao, J. Huang, X.-S. Hua, H. Lu, Structural and statistical texture knowledge distillation for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16876–16885.

[42]

Y. Liu, C. Shu, J. Wang, C. Shen, Structured knowledge distillation for dense prediction, IEEE Trans. Pattern Anal. Mach. Intell. 45 (6) (2020) 7035–7049.

[43]

F. Aleotti, G. Zaccaroni, L. Bartolomei, M. Poggi, F. Tosi, S. Mattoccia, Real-time single image depth perception in the wild with handheld devices, Sensors 21 (1) (2020) 15.

[44]

Y. Wang, X. Li, M. Shi, K. Xian, Z. Cao, Knowledge distillation for fast and accurate monocular depth estimation on mobile devices, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2457–2465.

[45]

L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, K. Ma, Be your own teacher: improve the performance of convolutional neural networks via self distillation, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 3713–3722.

[46]

H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2881–2890.

[47]

V. Guizilini, R. Ambrus, D. Chen, S. Zakharov, A. Gaidon, Multi-frame self-supervised depth with transformers, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 160–170.

[48]

T. Naderi, A. Sadovnik, J. Hayward, H. Qi, Monocular depth estimation with adaptive geometric attention, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 944–954.

[49]

J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141.

[50]

Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, Q. Hu, Eca-net: efficient channel attention for deep convolutional neural networks, Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. CVPR (2020) 11534–11542.

[51]

W. Gao, D. Rao, Y. Yang, J. Chen, Edge devices friendly self-supervised monocular depth estimation via knowledge distillation, IEEE Robot. Autom. Lett. 8 (12) (2023) 8470–8477.

[52]

J. Hu, C. Fan, H. Jiang, X. Guo, Y. Gao, X. Lu, T.L. Lam, Boosting lightweight depth estimation via knowledge distillation, in: Proceedings of the International Conference on Knowledge Science, Engineering and Management, 2023, pp. 27–39.

[53]

N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and support inference from rgbd images, in: Proceedings of the European Conference on Computer Vision (ECCV), 2012, pp. 746–760.

[54]

A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: the kitti dataset, Int. J. Robot. Res. 32 (11) (2013) 1231–1237.

Digital Library

[55]

L. Huynh, P. Nguyen-Ha, J. Matas, E. Rahtu, J. Heikkilä, Guiding monocular depth estimation using depth-attention volume, in: Proceedings of the European Conference on Computer Vision (ECCV), 2020, pp. 581–597.

[56]

S. Lee, J. Lee, B. Kim, E. Yi, J. Kim, Patch-wise attention network for monocular depth estimation, in: Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, 2021, pp. 1873–1881.

[57]

V. Patil, C. Sakaridis, A. Liniger, L. Van Gool, P3depth: monocular depth estimation with a piecewise planarity prior, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 1610–1621.

[58]

K. Shim, J. Kim, G. Lee, B. Shim, Depth-relative self attention for monocular depth estimation, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2023, pp. 1396–1404.

[59]

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., Pytorch: an imperative style, high-performance deep learning library, Neural Inf. Process. Syst. 32 (2019) 8024–8035.

[60]

Y. Kuznietsov, J. Stuckler, B. Leibe, Semi-supervised deep learning for monocular depth map prediction, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6647–6655.

Index Terms

RCFNet: Related cross-level feature network with cascaded self-distillation for monocular depth estimation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Index terms have been assigned to the content through auto-classification.

Recommendations

Self-supervised monocular depth estimation with self-distillation and dense skip connection
Abstract
Monocular depth estimation (MDE) is crucial in a wide range of applications, including robotics, autonomous driving and virtual reality. Self-supervised monocular depth estimation has emerged as a promising MDE approach without requiring hard-to-...
Highlights
- We propose a successive depth map self-distillation loss for self-supervised monocular depth estimation.
- We propose a dense skip connection strategy to improve the depth estimation effect of the depth network.
- We validate the ...
Self-distillation framework for indoor and outdoor monocular depth estimation
Abstract
As one of the most crucial tasks of scene perception, Monocular Depth Estimation (MDE) has made considerable development in recent years. Current MDE researchers are interested in the precision and speed of the estimation, but pay less attention ...
EDFIDepth: enriched multi-path vision transformer feature interaction networks for monocular depth estimation
Abstract
Monocular depth estimation (MDE) aims to predict pixel-level dense depth maps from a single RGB image. Some recent approaches mainly rely on encoder–decoder architectures to capture and process multi-scale features. However, they usually exploit ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Digital Signal Processing

Digital Signal Processing Volume 154, Issue C

Nov 2024

623 pages

Issue’s Table of Contents

Elsevier Inc.

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 November 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents