$^{\circ }$ Depth Estimation via Hybrid Projection Fusion and Structural Regularities | IEEE Transactions on Multimedia"/>
Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Distortion-Aware Self-Supervised Indoor 360<inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math></inline-formula> Depth Estimation via Hybrid Projection Fusion and Structural Regularities

Published: 01 January 2024 Publication History

Abstract

Owing to the rapid development of emerging 360<inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math></inline-formula> panoramic imaging techniques, indoor 360<inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math></inline-formula> depth estimation has aroused extensive attention in the community. Due to the lack of available ground truth depth data, it is extremely urgent to model indoor 360<inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math></inline-formula> depth estimation in self-supervised mode. However, self-supervised 360<inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math></inline-formula> depth estimation suffers from two major limitations. One is the distortion and network training problems caused by Equirectangular projection (ERP), and the other is that texture-less regions are quite difficult to back-propagate in self-supervised mode. Hence, to address the above issues, we introduce spherical view synthesis for learning self-supervised 360<inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math></inline-formula> depth estimation. Specifically, to alleviate the ERP-related problems, we first propose a dual-branch distortion-aware network to produce the coarse depth map, including a distortion-aware module and a hybrid projection fusion module. Subsequently, the coarse depth map is utilized for spherical view synthesis, in which a spherically weighted loss function for view reconstruction and depth smoothing is investigated to optimize the projection distribution problem of 360<inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math></inline-formula> images. In addition, two structural regularities of indoor 360<inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math></inline-formula> scenes are devised as two additional supervisory signals to efficiently optimize our self-supervised 360<inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math></inline-formula> depth estimation model, containing the principal-direction normal constraint and the co-planar depth constraint. The principal-direction normal constraint is designed to align the normal of the 360<inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math></inline-formula> image with the direction of the vanishing points. Meanwhile, we employ the co-planar depth constraint to fit the estimated depth of each pixel through its 3D plane. Finally, a depth map is obtained for the 360<inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math></inline-formula> image. Experimental results illustrate that our proposed method achieves superior performance than the current advanced depth estimation methods on four publicly available datasets.

References

[1]
Y. Jin, Z. Ji, D. Zeng, and X. Zhang, “VWP: An efficient DRL-based autonomous driving model,” IEEE Trans. Multimedia, early access, May 26, 2022.
[2]
P. Szabo, A. Simiscuka, S. Masneri, M. Zorrilla, and G.-M. Muntean, “A CNN-based framework for enhancing 360 VR experiences with multisensorial effects,” IEEE Trans. Multimedia, vol. 25, pp. 3245–3258, 2023.
[3]
F. Farbiz et al., “Live three-dimensional content for augmented reality,” IEEE Trans. Multimedia, vol. 7, no. 3, pp. 514–523, Jun. 2005.
[4]
C. Liu, D. Kong, S. Wang, J. Li, and B. Yin, “DLGAN: Depth-preserving latent generative adversarial network for 3D reconstruction,” IEEE Trans. Multimedia, vol. 23, pp. 2843–2856, 2021.
[5]
F. Wang et al., “Self-supervised learning of depth and camera motion from 360$^{\circ }$ videos,” in Proc. Asian Conf. Comput. Vis., 2018, pp. 53–68.
[6]
N. Zioulis, A. Karakottas, D. Zarpalas, F. Alvarez, and P. Daras, “Spherical view synthesis for self-supervised 360 depth estimation,” in Proc. IEEE Int. Conf. 3D Vis., 2019, pp. 690–699.
[7]
A. Chang et al., “Matterport3D: Learning from RGB-D data in indoor environments,” in Proc. Int. Conf. 3D Vis., 2017, pp. 667–676.
[8]
I. Armeni, S. Sax, A. R. Zamir, and S. Savarese, “Joint 2D-3D-semantic data for indoor scene understanding,” 2017, arXiv:1702.01105.
[9]
M. R. Area, M. Yuan, and C. Richardt, “360MonoDepth: High-resolution 360$^{\circ }$ monocular depth estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 3752–3762.
[10]
L. Jin et al., “Geometric structure based and regularized depth estimation from 360$^{\circ }$ indoor imagery,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 886–895.
[11]
W. Kong et al., “Self-supervised indoor 360-degree depth estimation via structural regularization,” in Proc. Pacific Rim Int. Conf. Artif. Intell., 2022, pp. 438–451.
[12]
Y. Wu et al., “Edge computing driven low-light image dynamic enhancement for object detection,” IEEE Trans. Netw. Sci. Eng., vol. 10, no. 5, pp. 3086–3098, Sep./Oct. 2023.
[13]
X. Li et al., “Generalized focal loss: Towards efficient representation learning for dense object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 3, pp. 3139–3153, Mar. 2023.
[14]
Y. Wu, L. Zhang, Z. Gu, H. Lu, and S. Wan, “Edge-AI-driven framework with efficient mobile network design for facial expression recognition,” ACM Trans. Embedded Comput. Syst., vol. 22, no. 3, pp. 1–17, 2023.
[15]
R. Monroy, S. Lutz, T. Chalasani, and A. Smolic, “SalNet360: Saliency maps for omni-directional images with CNN,” Signal Process.: Image Commun., vol. 69, pp. 26–34, 2018.
[16]
H. Cheng et al., “Cube padding for weakly-supervised saliency prediction in 360 videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 1420–1429.
[17]
R. Khasanova and P. Frossard, “Graph-based classification of omnidirectional images,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, 2017, pp. 869–878.
[18]
K. Tateno, N. Navab, and F. Tombari, “Distortion-aware convolutional filters for dense prediction in panoramic images,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 707–722.
[19]
J. Bai, S. Lai, H. Qin, J. Guo, and Y. Guo, “GLPanoDepth: Global-to-local panoramic depth estimation,” 2022, arXiv:2202.02796.
[20]
D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2014, pp. 2366–2374.
[21]
D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 2650–2658.
[22]
T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1851–1858.
[23]
C. Godard, O. M. Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 3828–3838.
[24]
Z. Yin and J. Shi, “GeoNet: Unsupervised learning of dense depth, optical flow and camera pose,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 1983–1992.
[25]
V. Guizilini, R. Hou, J. Li, R. Ambrus, and A. Gaidon, “Semantically-guided representation learning for self-supervised monocular depth,” in Proc. Int. Conf. Learn. Representations, 2019.
[26]
C. Shu, K. Yu, Z. Duan, and K. Yang, “Feature-metric loss for self-supervised learning of depth and egomotion,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 572–588.
[27]
B. Li, Y. Huang, Z. Liu, D. Zou, and W. Yu, “StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12663–12673.
[28]
H. Jiang, L. Ding, J. Hu, and R. Huang, “P2Net: Plane and line priors for unsupervised indoor depth estimation,” in Proc. IEEE Int. Conf. 3D Vis., 2021, pp. 741–750.
[29]
C. Ling, X. Zhang, and H. Chen, “Unsupervised monocular depth estimation using attention and multi-warp reconstruction,” IEEE Trans. Multimedia, vol. 24, pp. 2938–2949, 2022.
[30]
S. Shao et al., “Towards comprehensive monocular depth estimation: Multiple heads are better than one,” IEEE Trans. Multimedia, early access, Nov. 25, 2022.
[31]
R. Li et al., “Self-supervised monocular depth estimation with frequency-based recurrent refinement,” IEEE Trans. Multimedia, early access, Aug. 08, 2022.
[32]
N. Zioulis, A. Karakottas, D. Zarpalas, and P. Daras, “Omnidepth: Dense depth estimation for indoors spherical panoramas,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 448–465.
[33]
J. M. Coughlan and A. L. Yuille, “Manhattan world: Compass direction from a single image by Bayesian inference,” in Proc. IEEE Int. Conf. Comput. Vis., 1999, pp. 941–947.
[34]
Z. Yu, J. Zheng, D. Lian, Z. Zhou, and S. Gao, “Single-image piece-wise planar 3D reconstruction via associative embedding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 1029–1037.
[35]
D. Zou, Y. Wu, L. Pei, H. Ling, and W. Yu, “StructVIO: Visual-inertial odometry with structural regularity of man-made environments,” IEEE Trans. Robot., vol. 35, no. 4, pp. 999–1013, Aug. 2019.
[36]
Y. Hasegawa, I. Satoshi, and K. Aizawa, “Distortion-aware self-supervised 360$^{\circ }$ depth estimation from a single equirectangular projection image,” 2022, arXiv:2204.01027.
[37]
H. Jiang, Z. Sheng, S. Zhu, Z. Dong, and R. Huang, “UniFuse: Unidirectional fusion for 360 panorama depth estimation,” IEEE Robot. Autom. Lett., vol. 6, no. 2, pp. 1519–1526, Apr. 2021.
[38]
F. Wang, Y. Yeh, M. Sun, W. Chiu, and Y. Tsai, “BiFuse: Monocular 360 depth estimation via bi-projection fusion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 462–471.
[39]
J. Liu, W. Zhang, Y. Tang, J. Tang, and G. Wu, “Residual feature aggregation network for image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 2359–2368.
[40]
C. Godard, O. M. Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 270–279.
[41]
C. Fernandez-Labrador, A. Perez-Yus, G. Lopez-Nicolas, and J. J. Guerrero, “Layouts from panoramic images with geometry and deep learning,” IEEE Robot. Autom. Lett., vol. 3, no. 4, pp. 3153–3160, Oct. 2018.
[42]
L. Ding and A. Goshtasby, “On the Canny edge detector,” Pattern Recognit., vol. 34, no. 3, pp. 721–725, 2001.
[43]
K. G. Derpanis, “Overview of the RANSAC algorithm,” Image Rochester NY, vol. 4, no. 1, pp. 2–3, 2010.
[44]
Z. Yang et al., “LEGO: Learning edge with geometry all at once by watching videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 225–234.
[45]
C. Sun, M. Sun, and H. Chen, “HoHoNet: 360 indoor holistic understanding with latent horizontal features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 2573–2582.
[46]
Y. Li et al., “OmniFusion: 360 monocular depth estimation via geometry-aware fusion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 2801–2810.
[47]
Z. Yu, L. Jin, and S. Gao, “$\rm P^{2}$Net: Patch-match and plane-regularization for unsupervised indoor depth estimation,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 206–222.
[48]
F.-E. Wang, Y.-H. Yeh, Y.-H. Tsai, W.-C. Chiu, and M. Sun, “BiFuse: Self-supervised and efficient bi-projection fusion for 360 depth estimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 5, pp. 5448–5460, May 2023.
[49]
P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based image segmentation,” Int J. Comput Vis, vol. 59, no. 2, pp. 167–181, 2004.
[50]
J. Deng et al., “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.2009, pp. 248–255.
[51]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representations, 2015, pp. 1–15.

Index Terms

  1. Distortion-Aware Self-Supervised Indoor 360$^{\circ }$ Depth Estimation via Hybrid Projection Fusion and Structural Regularities
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Please enable JavaScript to view thecomments powered by Disqus.

            Information & Contributors

            Information

            Published In

            cover image IEEE Transactions on Multimedia
            IEEE Transactions on Multimedia  Volume 26, Issue
            2024
            11427 pages

            Publisher

            IEEE Press

            Publication History

            Published: 01 January 2024

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • 0
              Total Citations
            • 0
              Total Downloads
            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 05 Mar 2025

            Other Metrics

            Citations

            View Options

            View options

            Figures

            Tables

            Media

            Share

            Share

            Share this Publication link

            Share on social media