Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

RCFNet: : Related cross-level feature network with cascaded self-distillation for monocular depth estimation

Published: 21 November 2024 Publication History

Abstract

Monocular depth estimation (MDE) is a challenging yet crucial computer vision task, which aims to generate accurate depth maps from a single image. Existing MDE approaches mainly rely on extracting and fusing diverse information from multi-level features to improve prediction accuracy. However, these methods often apply a traditional feature pyramid structure, neglecting a comprehensive exploration of feature fusion paths across multiple levels. Moreover, a single-feature fusion strategy has limited ability to optimize the network. We propose a novel related cross-level feature network (RCFNet) with cascaded self-distillation for monocular depth estimation, including a cross-level feature enhancement (CLFE) module, a hierarchical feature cross refinement (HFCR) module, and a cascaded self-distillation (CSD) module. The CLFE module integrates cross-level features to further exploit the highest-level features, where a channel attention mechanism with hybrid weight operations is deployed to enhance the initial features. The HFCR module adaptively captures strongly correlated complementary information through a window-based multi-head cross-attention mechanism to generate refined features. Meanwhile, the CSD module with a hierarchical feature transformation loss is proposed, which can be viewed as a virtual teacher to progressively extract discriminative features within the network for better gradient flow improvement. Extensive experiments on NYUv2 and KITTI datasets demonstrate that our method outperforms existing SOTA MDE methods in terms of accuracy capacity and robustness.

References

[1]
S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, et al., Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera, in: Proceedings of the ACM Symposium on User Interface Software and Technology, 2011, pp. 559–568.
[2]
A. Geiger, L. Philip, U. Raquel, Are we ready for autonomous driving? The kitti vision benchmark suite, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 3354–3361.
[3]
D. Hoiem, A.N. Stein, A.A. Efros, M. Hebert, Recovering occlusion boundaries from a single image, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2007, pp. 1–8.
[4]
B. Liu, S. Gould, D. Koller, Single image depth estimation from predicted semantic labels, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 1253–1260.
[5]
M. Liu, M. Salzmann, X. He, Discrete-continuous depth estimation from a single image, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 716–723.
[6]
D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single image using a multi-scale deep network, Neural Inf. Process. Syst. 27 (2014) 2366–2374.
[7]
D. Xu, E. Ricci, W. Ouyang, X. Wang, N. Sebe, Multi-scale continuous crfs as sequential deep networks for monocular depth estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5354–5362.
[8]
S.F. Bhat, I. Alhashim, P. Wonka, Adabins: depth estimation using adaptive bins, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 4009–4018.
[9]
D. Eigen, R. Fergus, Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2650–2658.
[10]
W. Yin, Y. Liu, C. Shen, Y. Yan, Enforcing geometric constraints of virtual normal for depth prediction, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 5684–5693.
[11]
F. Liu, C. Shen, G. Lin, I. Reid, Learning depth from single monocular images using deep convolutional neural fields, IEEE Trans. Pattern Anal. Mach. Intell. 38 (10) (2015) 2024–2039.
[12]
Y. Sun, C. Xia, X. Gao, H. Yan, B. Ge, K.-C. Li, Aggregating dense and attentional multi-scale feature network for salient object detection, Digit. Signal Process. 130 (2022).
[13]
C. Xia, Y. Sun, X. Fang, B. Ge, X. Gao, K.-C. Li, Imsfnet: integrated multi-source feature network for salient object detection, Appl. Intell. 53 (19) (2023) 22228–22248.
[14]
H. Lin, W. Huang, W. Luo, W. Lu, Deepfake detection with multi-scale convolution and vision transformer, Digit. Signal Process. 134 (2023).
[15]
M. Song, S. Lim, W. Kim, Monocular depth estimation using laplacian pyramid-based depth residuals, IEEE Trans. Circuits Syst. Video Technol. 31 (11) (2021) 4381–4393.
[16]
R. Ranftl, A. Bochkovskiy, V. Koltun, Vision transformers for dense prediction, in: Proceedings of the IEEE International Conference on Computer Vision (TCCV), 2021, pp. 12179–12188.
[17]
L. Wang, J. Zhang, Y. Wang, H. Lu, X. Ruan, Cliffnet for monocular depth estimation with hierarchical embedding loss, in: Proceedings of the European Conference on Computer Vision (ECCV), 2020, pp. 316–331.
[18]
Z. Cheng, Y. Zhang, C. Tang, Swin-depth: using transformers and multi-scale fusion for monocular-based depth estimation, IEEE Sens. J. 21 (23) (2021) 26912–26920.
[19]
M. Lee, S. Hwang, C. Park, S. Lee, Edgeconv with attention module for monocular depth estimation, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 2858–2867.
[20]
Z. Fu, S. Hong, M. Liu, H. Laga, M. Bennamoun, F. Boussaid, Y. Guo, Multi-stage information diffusion for joint depth and surface normal estimation, Pattern Recognit. 141 (2023).
[21]
S. Shao, R. Li, Z. Pei, Z. Liu, W. Chen, W. Zhu, X. Wu, B. Zhang, Towards comprehensive monocular depth estimation: multiple heads are better than one, IEEE Trans. Multimed. (2023) 7660–7671.
[22]
J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, H. Lu, Dual attention network for scene segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3146–3154.
[23]
G. Cheng, P. Lai, D. Gao, J. Han, Class attention network for image recognition, Sci. China Inf. Sci. 66 (3) (2023).
[24]
G. Wang, G. Cheng, P. Zhou, J. Han, Cross-level attentive feature aggregation for change detection, IEEE Trans. Circuits Syst. Video Technol. (2023).
[25]
W. Yuan, X. Gu, Z. Dai, S. Zhu, P. Tan, Neural window fully-connected crfs for monocular depth estimation, 2022, pp. 3916–3925.
[26]
C. Ning, H. Gan, Trap attention: monocular depth estimation with manual traps, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 5033–5043.
[27]
A. Agarwal, C. Arora, Attention attention everywhere: monocular depth prediction with skip attention, 2023, pp. 5861–5870.
[28]
A. Pilzer, S. Lathuiliere, N. Sebe, E. Ricci, Refine and distill: exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation, 2019, pp. 9768–9777.
[29]
S. Shao, Z. Pei, W. Chen, R. Li, Z. Liu, Z. Li, Urcdc-depth: uncertainty rectified cross-distillation with cutflip for monocular depth estimation, IEEE Trans. Multimed. (2023).
[30]
I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, N. Navab, Deeper depth prediction with fully convolutional residual networks, in: Proceedings of the International Conference on 3D Vision (3DV), 2016, pp. 239–248.
[31]
H. Fu, M. Gong, C. Wang, K. Batmanghelich, D. Tao, Deep ordinal regression network for monocular depth estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2002–2011.
[32]
Lee, J.H.; Han, M.-K.; Ko, D.W.; Suh, I.H. (2019): From big to small: multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326.
[33]
A. Johnston, G. Carneiro, Self-supervised monocular trained depth estimation using self-attention, and discrete disparity volume, 2020, pp. 4756–4765.
[34]
G. Yang, H. Tang, M. Ding, N. Sebe, E. Ricci, Transformer-based attention networks for continuous pixel-wise prediction, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021, pp. 16269–16279.
[35]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: transformers for image recognition at scale, in: Proceedings of the International Conference on Learning Representations (ICLR), 2021.
[36]
L. Piccinelli, C. Sakaridis, F. Yu, idisc: internal discretization for monocular depth estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 21477–21487.
[37]
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021, pp. 10012–10022.
[38]
Hinton, G.; Vinyals, O.; Dean, J. (2015): Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
[39]
P. Luo, Z. Zhu, Z. Liu, X. Wang, X. Tang, Face model compression by distilling knowledge from neurons, Proc. AAAI Conf. Artif. Intell. 30 (1) (2016).
[40]
H. Cho, J. Choi, G. Baek, W. Hwang, itkd: interchange transfer-based knowledge distillation for 3d object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 13540–13549.
[41]
D. Ji, H. Wang, M. Tao, J. Huang, X.-S. Hua, H. Lu, Structural and statistical texture knowledge distillation for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16876–16885.
[42]
Y. Liu, C. Shu, J. Wang, C. Shen, Structured knowledge distillation for dense prediction, IEEE Trans. Pattern Anal. Mach. Intell. 45 (6) (2020) 7035–7049.
[43]
F. Aleotti, G. Zaccaroni, L. Bartolomei, M. Poggi, F. Tosi, S. Mattoccia, Real-time single image depth perception in the wild with handheld devices, Sensors 21 (1) (2020) 15.
[44]
Y. Wang, X. Li, M. Shi, K. Xian, Z. Cao, Knowledge distillation for fast and accurate monocular depth estimation on mobile devices, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2457–2465.
[45]
L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, K. Ma, Be your own teacher: improve the performance of convolutional neural networks via self distillation, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 3713–3722.
[46]
H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2881–2890.
[47]
V. Guizilini, R. Ambrus, D. Chen, S. Zakharov, A. Gaidon, Multi-frame self-supervised depth with transformers, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 160–170.
[48]
T. Naderi, A. Sadovnik, J. Hayward, H. Qi, Monocular depth estimation with adaptive geometric attention, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 944–954.
[49]
J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141.
[50]
Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, Q. Hu, Eca-net: efficient channel attention for deep convolutional neural networks, Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. CVPR (2020) 11534–11542.
[51]
W. Gao, D. Rao, Y. Yang, J. Chen, Edge devices friendly self-supervised monocular depth estimation via knowledge distillation, IEEE Robot. Autom. Lett. 8 (12) (2023) 8470–8477.
[52]
J. Hu, C. Fan, H. Jiang, X. Guo, Y. Gao, X. Lu, T.L. Lam, Boosting lightweight depth estimation via knowledge distillation, in: Proceedings of the International Conference on Knowledge Science, Engineering and Management, 2023, pp. 27–39.
[53]
N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and support inference from rgbd images, in: Proceedings of the European Conference on Computer Vision (ECCV), 2012, pp. 746–760.
[54]
A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: the kitti dataset, Int. J. Robot. Res. 32 (11) (2013) 1231–1237.
[55]
L. Huynh, P. Nguyen-Ha, J. Matas, E. Rahtu, J. Heikkilä, Guiding monocular depth estimation using depth-attention volume, in: Proceedings of the European Conference on Computer Vision (ECCV), 2020, pp. 581–597.
[56]
S. Lee, J. Lee, B. Kim, E. Yi, J. Kim, Patch-wise attention network for monocular depth estimation, in: Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, 2021, pp. 1873–1881.
[57]
V. Patil, C. Sakaridis, A. Liniger, L. Van Gool, P3depth: monocular depth estimation with a piecewise planarity prior, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 1610–1621.
[58]
K. Shim, J. Kim, G. Lee, B. Shim, Depth-relative self attention for monocular depth estimation, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2023, pp. 1396–1404.
[59]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., Pytorch: an imperative style, high-performance deep learning library, Neural Inf. Process. Syst. 32 (2019) 8024–8035.
[60]
Y. Kuznietsov, J. Stuckler, B. Leibe, Semi-supervised deep learning for monocular depth map prediction, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6647–6655.

Index Terms

  1. RCFNet: Related cross-level feature network with cascaded self-distillation for monocular depth estimation
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image Digital Signal Processing
          Digital Signal Processing  Volume 154, Issue C
          Nov 2024
          623 pages

          Publisher

          Academic Press, Inc.

          United States

          Publication History

          Published: 21 November 2024

          Author Tags

          1. Attention mechanism
          2. Cross-level features
          3. Monocular depth estimation
          4. Self-distillation

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 0
            Total Downloads
          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 20 Nov 2024

          Other Metrics

          Citations

          View Options

          View options

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media