Three-Dimensional Reconstruction of Indoor Scenes Based on Implicit Neural Representation
<p>(<b>a</b>) Distortion of reconstructed 3D models under uneven lighting conditions enclosed by the red dashed box; (<b>b</b>) distortion of 3D reconstruction of smooth planar texture areas enclosed by the red dashed box; (<b>c</b>) floating debris noise in red box in 3D reconstruction.</p> "> Figure 2
<p>Overall framework of the method.</p> "> Figure 3
<p>(<b>a</b>) The normal estimation is inaccurate in some fine structures enclosed by red dashed box, such as chair legs, based on TiltedSN normal estimation module; (<b>b</b>) we use an adaptive normal prior method to derive accurate normals based on the consistency of adjacent images. In the red dashed box, the fine structures are accurately reconstructed.</p> "> Figure 3 Cont.
<p>(<b>a</b>) The normal estimation is inaccurate in some fine structures enclosed by red dashed box, such as chair legs, based on TiltedSN normal estimation module; (<b>b</b>) we use an adaptive normal prior method to derive accurate normals based on the consistency of adjacent images. In the red dashed box, the fine structures are accurately reconstructed.</p> "> Figure 4
<p>Neural implicit reconstruction process.</p> "> Figure 5
<p>Distribution diagram of distance and weight values between sampling points.</p> "> Figure 6
<p>Three-dimensional model reconstructed from scenes in the ScanNet dataset. (<b>a</b>) Comparison of 3D models; (<b>b</b>) comparison of the specific details in the red dashed box.</p> "> Figure 7
<p>Qualitive comparison for thin structure areas using ScanNet dataset: (<b>a</b>) reference image; (<b>b</b>) model reconstructed without using normal prior; (<b>c</b>) model reconstructed with normal prior and without adaptive scheme; (<b>d</b>) model reconstructed with normal prior and adaptive scheme.</p> "> Figure 8
<p>Qualitive comparison for reflective areas using Hypersim dataset: (<b>a</b>) reference image; (<b>b</b>) model reconstructed without using normal prior; (<b>c</b>) model reconstructed with normal prior and without adaptive scheme; (<b>d</b>) model reconstructed with normal prior and adaptive scheme.</p> "> Figure 9
<p>Visual comparison for a scene with a large amount of floating debris using the ScanNet dataset; (<b>a</b>) reconstruction result without adding a distortion loss function; (<b>b</b>) reconstruction result with a distortion loss function.</p> "> Figure 10
<p>Visual comparison for a scene with single floating debris areas enclosed by red dashed box using the ScanNet dataset; (<b>a</b>) reconstruction result without adding distortion loss function; (<b>b</b>) reconstruction result with distortion loss function.</p> "> Figure 11
<p>The limitations of this method in the 3D reconstruction of scenes with clutter, occlusion, soft non-solid objects, and blurred images, using the ScanNet dataset.</p> ">
Abstract
:1. Introduction
- (1)
- It proposes a new indoor 3D reconstruction method that combines NeRF and SDF scene expression, which not only preserves the high-quality geometric information of the NeRF, but also uses the SDF to generate an explicit mesh with a smooth surface.
- (2)
- By adding adaptive normal priors to provide globally consistent geometric constraints, the reconstruction quality of planar texture areas and details is significantly improved.
- (3)
- By introducing a new regularization term, the problem of uneven distribution of NeRF density is alleviated, and the effect of removing floating debris is achieved in the final generated model, which improves the look and feel of the visualization results.
2. Related Works
2.1. Three-Dimensional Reconstruction of Based on Visual SLAM
2.2. Three-Dimensional Reconstruction Based on TSDF
2.3. Three-Dimensional Reconstruction Based on MVS
2.4. Three-Dimensional Reconstruction Based on Implicit Neural Networks
3. Methodology
- Normal estimation module: This module uses a spatial rectifier-based method to generate the corresponding normal map for a single RGB image, and prepares data for the prior part of neural implicit reconstruction.
- NeRF module: The appearance decomposition and feature processing of the scene image are performed through the neural radiant field, and the volume density and color are obtained. The image under the corresponding perspective is obtained by volume rendering, and the MLP parameters are optimized inversely with the input image loss.
- SDF field module: The purpose of this module is to learn a high-quality SDF from the network, and at the same time, strengthen the network’s understanding of the geometric structure through the normal prior. The implicit-3D-expression SDF is converted into an explicit triangular mesh through the Marching Cubes algorithm.
3.1. Optimization of Indoor 3D Reconstruction Based on Adaptive Normal Prior
3.2. Neural Implicit Reconstruction
3.2.1. Scene Representation
- (1)
- Clear surface definition: The SDF provides the distance to the nearest surface for each spatial point, where the surface is defined as the location where the SDF value is zero. This representation is well suited for extracting clear and precise surfaces, making the conversion from SDF to mesh relatively direct and efficient.
- (2)
- Geometric accuracy: The SDF can accurately represent sharp edges and complex topological structures, which can be maintained when converted to meshes, thereby generating high-quality 3D models.
- (3)
- Optimization-friendly: The information provided by the SDF can be directly used to perform geometry optimization and mesh smoothing operations, which helps to further improve the quality of the model when generating the mesh.
3.2.2. Implicit Indoor 3D Reconstruction Based on Normal Prior
3.3. Floating Debris Removal
- (1)
- If the distance between point u and point v is relatively far, that is, the value of is large, if you want to ensure that the value of is as small as possible, then either or needs to be small and close to zero. That is, as shown in Figure 5, (A, B), (B, D), (A, D), etc. all satisfy that is large and the weight value of at least one point is extremely small (close to zero);
- (2)
- If the values of and are both large, if you want to ensure that the value of is as small as possible, then the value of needs to be small, that is, the distance between points u and v is very close. As shown in Figure 5, only the combination of (B, C) satisfies the condition that the values of and are both large, and at this time, points and just meet the condition that the distance is very close.
- Minimize the width of each interval;
- Bring the intervals that are far apart closer to each other;
- Make the weight distribution more concentrated.
3.4. Training and Loss Function
4. Experimentation
4.1. Dataset
4.2. Comparative Experiment
4.3. Ablation Experiment
4.3.1. Normal Geometry Prior Ablation Experiment
4.3.2. Ablation Experiment of Distortion Loss Function
4.4. Limitations
5. Conclusion and Future Work
- When there are many objects and elements in the scene and they are irregular, this method can only reconstruct the general outline, and the object category cannot be directly determined by these outlines. At the same time, the reconstructed results at the connections between objects and between objects and backgrounds are discontinuous. One solution is to try to introduce more priors to allow the neural network to obtain more useful information to accurately learn and understand the elements in the scene. Another feasible solution is to find a way to distinguish between object areas and non-object areas, and then learn them separately, which is conducive to further capturing more complex details.
- This method takes several hours to more than ten hours to train and optimize a single indoor scene, which limits the application of this method in reconstruction over a relatively large range and in real-time reconstruction. One possible solution to improve training efficiency is to use hashed multi-resolution encoding to use a smaller network without sacrificing quality, thereby significantly reducing the number of floating-point and memory access operations, allowing the neural network to be trained at a smaller computational cost while maintaining reconstruction quality, greatly reducing training time.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kang, Z.; Yang, J.; Yang, Z.; Cheng, S. A review of techniques for 3d reconstruction of indoor environments. ISPRS Int. J. Geo-Inf. 2020, 9, 330. [Google Scholar] [CrossRef]
- Li, J.; Gao, W.; Wu, Y.; Liu, Y.; Shen, Y. High-quality indoor scene 3d reconstruction with rgb-d cameras: A brief review. Comput. Vis. Media 2022, 8, 369–393. [Google Scholar] [CrossRef]
- Hess, W.; Kohler, D.; Rapp, H.; Andor, D. Real-Time Loop Closure in 2d Lidar Slam. In Proceedings of the2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 1271–1278. [Google Scholar]
- Zhang, J.; Singh, S. Loam: Lidar odometry and mapping in real-time. In Robotics: Science and Systems; University of California: Berkeley, CA, USA, 2014; Volume 2, pp. 1–9. [Google Scholar]
- Shan, T.; Englot, B. Lego-loam: Lightweight and Ground-Optimized Lidar Odometry and Mapping on Variable Terrain. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 4758–4765. [Google Scholar]
- Curless, B.; Levoy, M. A Volumetric Method for Building Complex Models from Range Images. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, New Orleans, LA, USA, 4–9 August 1996; pp. 303–312. [Google Scholar]
- Newcombe, R.A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A.J.; Kohi, P.; Shotton, J.; Hodges, S.; Fitzgibbon, A. Kinectfusion: Real-Time Dense Surface Mapping and Tracking. In Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality, Basel, Switzerland, 26–29 October 2011; pp. 127–136. [Google Scholar]
- Murez, Z.; Van As, T.; Bartolozzi, J.; Sinha, A.; Badrinarayanan, V.; Rabinovich, A. Atlas: End-to-end 3d scene reconstruction from posed images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 414–431. [Google Scholar]
- Sun, J.; Xie, Y.; Chen, L.; Zhou, X.; Bao, H. Neuralrecon: Real-Time Coherent 3d Reconstruction from Monocular Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15598–15607. [Google Scholar]
- Schonberger, J.L.; Frahm, J.M. Structure-from-Motion Revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4104–4113. [Google Scholar]
- Schönberger, J.L.; Zheng, E.; Frahm, J.M.; Pollefeys, M. Pixelwise View Selection for Unstructured Multi-View Stereo. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part III 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 501–518. [Google Scholar]
- Xu, Q.; Tao, W. Planar prior assisted patchmatch multi-view stereo. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12516–12523. [Google Scholar] [CrossRef]
- Im, S.; Jeon, H.G.; Lin, S.; Kweon, I.S. Dpsnet: End-to-end deep plane sweep stereo. arXiv 2019, arXiv:1905.00538. [Google Scholar]
- Wang, F.; Galliani, S.; Vogel, C.; Speciale, P.; Pollefeys, M. Patchmatchnet: Learned Multi-View Patchmatch Stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14194–14203. [Google Scholar]
- Xu, Q.; Tao, W. Pvsnet: Pixelwise visibility-aware multi-view stereo network. arXiv 2020, arXiv:2007.07714. [Google Scholar]
- Yao, Y.; Luo, Z.; Li, S.; Fang, T.; Quan, L. Mvsnet: Depth Inference for Unstructured Multi-View Stereo. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 767–783. [Google Scholar]
- Yu, Z.; Gao, S. Fast-Mvsnet: Sparse-to-Dense Multi-View Stereo with Learned Propagation and Gauss-Newton Refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1949–1958. [Google Scholar]
- Teed, Z.; Deng, J. Deepv2d: Video to depth with differentiable structure from motion. arXiv 2018, arXiv:1812.04605. [Google Scholar]
- Huang, P.H.; Matzen, K.; Kopf, J.; Ahuja, N.; Huang, J.B. Deepmvs: Learning Multi-View Stereopsis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2821–2830. [Google Scholar]
- Cheng, S.; Xu, Z.; Zhu, S.; Li, Z.; Li, L.E.; Ramamoorthi, R.; Su, H. Deep Stereo using Adaptive thin Volume Representation with Uncertainty Awareness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2524–2534. [Google Scholar]
- Yang, H.; Chen, R.; An, S.P.; Wei, H.; Zhang, H. The growth of image-related three dimensional reconstruction techniquesin deep learning-driven era: A critical summary. J. Image Graph. 2023, 28, 2396–2409. [Google Scholar]
- Liu, S.; Zhang, Y.; Peng, S.; Shi, B.; Pollefeys, M.; Cui, Z. Dist: Rendering deep implicit signed distance function with differentiable sphere tracing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2019–2028. [Google Scholar]
- Yariv, L.; Kasten, Y.; Moran, D.; Galun, M.; Atzmon, M.; Ronen, B.; Lipman, Y. Multiview neural surface reconstruction by disentangling geometry and appearance. Adv. Neural Inf. Process. Syst. 2020, 33, 2492–2502. [Google Scholar]
- Wang, P.; Liu, L.; Liu, Y.; Theobalt, C.; Komura, T.; Wang, W. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv 2021, arXiv:2106.10689. [Google Scholar]
- Yariv, L.; Gu, J.; Kasten, Y.; Lipman, Y. Volume rendering of neural implicit surfaces. Adv. Neural Inf. Process. Syst. 2021, 34, 4805–4815. [Google Scholar]
- Barron, J.T.; Mildenhall, B.; Verbin, D.; Srinivasan, P.P.; Hedman, P. Mip-nerf 360: Unbounded Anti-Aliased Neural Radiance Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
- Liu, L.; Gu, J.; Zaw Lin, K.; Chua, T.S.; Theobalt, C. Neural sparse voxel fields. Adv. Neural Inf. Process. Syst. 2020, 33, 15651–15663. [Google Scholar]
- Rebain, D.; Matthews, M.; Yi, K.M.; Lagun, D.; Tagliasacchi, A. Lolnerf: Learn from one look. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
- Gafni, G.; Thies, J.; Zollhofer, M.; Nießner, M. Dynamic Neural Radiance Fields for Monocular 4d Facial Avatar Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Do, T.; Vuong, K.; Roumeliotis, S.I.; Park, H.S. Surface Normal Estimation of Tilted Images Via Spatial Rectifier. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16; Springer International Publishing: New York, NY, USA, 2020; pp. 265–280. [Google Scholar]
- Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. Scannet: Richly-Annotated 3d Reconstructions of Indoor Scenes. In Proceedings of the IEEE Conference on Computer vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5828–5839. [Google Scholar]
- Roberts, M.; Ramapuram, J.; Ranjan, A.; Kumar, A.; Bautista, M.A.; Paczan, N.; Webb, R.; Susskind, J.M. Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10912–10922. [Google Scholar]
- Straub, J.; Whelan, T.; Ma, L.; Chen, Y.; Wijmans, E.; Green, S.; Engel, J.J.; Mur-Artal, R.; Ren, C.; Verma, S.; et al. The replica dataset: A digital replica of indoor spaces. arXiv 2019, arXiv:1906.05797. [Google Scholar]
- Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
- Guo, H.; Peng, S.; Lin, H.; Wang, Q.; Zhang, G.; Bao, H.; Zhou, X. Neural 3d scene reconstruction with the Manhattan-world assumption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 5511–5520. [Google Scholar]
- Yu, Z.; Peng, S.; Niemeyer, M.; Sattler, T.; Geiger, A. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. Adv. Neural Inf. Process. Syst. 2022, 35, 25018–25032. [Google Scholar]
- Zhu, J.; Huo, Y.; Ye, Q.; Luan, F.; Li, J.; Xi, D.; Wang, L.; Tang, R.; Hua, W.; Bao, H.; et al. I2-SDF: Intrinsic Indoor Scene Reconstruction and Editing via Raytracing in Neural SDFs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 12489–12498. [Google Scholar]
Dataset | Scene Number | Scenes Selected in This Paper |
---|---|---|
ScanNet | 1500+ | 10 |
Hypersim | 461 | 10 |
Replica | 18 | 10 |
Method | F-Score | ||||
---|---|---|---|---|---|
COLMAP [10] | 0.047 | 0.235 | 0.711 | 0.441 | 0.537 |
Atlas [8] | 0.211 | 0.070 | 0.500 | 0.659 | 0.564 |
NeuralRacon [11] | 0.056 | 0.081 | 0.545 | 0.604 | 0.572 |
Neus [24] | 0.179 | 0.208 | 0.313 | 0.275 | 0.291 |
VolSDF [25] | 0.414 | 0.120 | 0.321 | 0.394 | 0.346 |
NeRF [34] | 0.735 | 0.177 | 0.131 | 0.290 | 0.176 |
Manhattan-SDF [35] | 0.072 | 0.068 | 0.621 | 0.586 | 0.602 |
MonoSDF [36] | 0.035 | 0.048 | 0.799 | 0.681 | 0.733 |
I2-SDF [37] | 0.066 | 0.070 | 0.605 | 0.575 | 0.590 |
Ours | 0.037 | 0.048 | 0.801 | 0.702 | 0.748 |
Method | F-Score | ||||
---|---|---|---|---|---|
w/o N | 0.183 | 0.152 | 0.286 | 0.290 | 0.284 |
w/N, w/o A | 0.050 | 0.053 | 0.759 | 0.699 | 0.727 |
Ours | 0.037 | 0.048 | 0.805 | 0.709 | 0.753 |
Method | F-Score | ||||
---|---|---|---|---|---|
w/o | 0.055 | 0.052 | 0.742 | 0.701 | 0.721 |
Ours | 0.047 | 0.052 | 0.795 | 0.704 | 0.746 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lin, Z.; Huang, Y.; Yao, L. Three-Dimensional Reconstruction of Indoor Scenes Based on Implicit Neural Representation. J. Imaging 2024, 10, 231. https://doi.org/10.3390/jimaging10090231
Lin Z, Huang Y, Yao L. Three-Dimensional Reconstruction of Indoor Scenes Based on Implicit Neural Representation. Journal of Imaging. 2024; 10(9):231. https://doi.org/10.3390/jimaging10090231
Chicago/Turabian StyleLin, Zhaoji, Yutao Huang, and Li Yao. 2024. "Three-Dimensional Reconstruction of Indoor Scenes Based on Implicit Neural Representation" Journal of Imaging 10, no. 9: 231. https://doi.org/10.3390/jimaging10090231
APA StyleLin, Z., Huang, Y., & Yao, L. (2024). Three-Dimensional Reconstruction of Indoor Scenes Based on Implicit Neural Representation. Journal of Imaging, 10(9), 231. https://doi.org/10.3390/jimaging10090231