Multi-distribution fitting for multi-view stereo

Jinguang Chen ORCID: orcid.org/0000-0002-6044-6296¹,
Zonghua Yu¹,
Lili Ma¹ &
…
Kaibing Zhang¹

351 Accesses
Explore all metrics

Abstract

We propose a multi-view stereo network based on multi-distribution fitting (MDF-Net), which achieves high-resolution depth map prediction with low memory and high efficiency. This method adopts a four-stage cascade structure, which mainly has the following three contributions. First, view cost regularization is proposed to weaken the influence of matching noise on building the cost volume. Second, it is suggested to adaptively calculate the depth refinement interval using multi-distribution fitting (MDF). Gaussian distribution fitting is used to refine and correct depth within a large interval, and then Laplace distribution fitting is used to accurately estimate depth within a small interval. Third, the lightweight image super-resolution network is applied to upsample the depth map in the fourth stage to reduce running time and memory requirements. The experimental results on the DTU dataset indicate that MDF-Net has achieved the most advanced results. It has the lowest memory consumption and running time among the high-resolution reconstruction methods, requiring only approximately 4.29G memory for predicting a depth map with the resolution of 1600 × 1184. In addition, we validate the generalization ability on Tanks and Temples dataset, achieving very competitive performance. The code has been released at https://github.com/zongh5a/MDF-Net.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LE-MVSNet: Lightweight Efficient Multi-view Stereo Network

Integration of Depth Normal Consistency and Depth Map Refinement for MVS Reconstruction

Adaptive Cost Aggregation in Iterative Depth Estimation for Efficient Multi-view Stereo

References

Campbell, N.D.F., Vogiatzis, G., Hernández, C., Cipolla, R.: Using multiple hypotheses to improve depth-maps for multi-view stereo. In: Proceedings of European Conference on Computer Vision (ECCV), pp. 766–779 (2008)
Furukawa, Y., Ponce, J.: Accurate, dense, and robust multiview stereopsis. IEEE Trans. Pattern Anal. Mach. Intell. 32(8), 1362–1376 (2010)
Article Google Scholar
Tola, E., Strecha, C., Fua, P.: Efficient large-scale multi-view stereo for ultra high-resolution image sets. Mach. Vis. Appl. 23(5), 903–920 (2012)
Article Google Scholar
Galliani, S., Lasinger, K., Schindler, K.: Massively parallel multiview stereopsis by surface normal diffusion. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 873–881 (2015)
Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: MVSNet: depth inference for unstructured multi-view stereo. In: Proceedings of European Conference on Computer Vision (ECCV), pp. 785–801 (2018)
Yao, Y., Luo, Z., Li, S., Shen, T., Fang, T., Quan, L.: Recurrent MVSNet for high-resolution multi-view stereo depth inference. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5520–5529 (2019)
Chen, R., Han, S., Xu, J., Su, H.: Point-based multi-view stereo network. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 1538–1547 (2019)
Yu, Z., Gao, S.: Fast-MVSNet: Sparse-to-dense multi-view stereo with learned propagation and gauss-newton refinement. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1946–1955 (2020)
Yi, H., et al.: Pyramid multi-view stereo net with self-adaptive view aggregation. In: Proceedings of European Conference on Computer Vision (ECCV), pp. 766–782 (2020)
Gu, X., Fan, Z., Zhu, S., Dai, Z., Tan, F., Tan, P.: Cascade cost volume for high-resolution multi-view stereo and stereo matching. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2492–2501 (2020)
Yang, J., Mao, W., Alvarez, J.M., Liu, M.: Cost volume pyramid based depth inference for multi-view stereo. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4876–4885 (2020)
Wang, F., Galliani, S., Vogel, C., Speciale, P., Pollefeys, M.: PatchmatchNet: learned multi-view patchmatch stereo. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14189–14198 (2021)
Cheng, S., et al.: Deep stereo using adaptive thin volume representation with uncertainty awareness. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2521–2531 (2020)
Ji, M., Gall, J., Zheng, H., Liu, Y., Fang, L.: SurfaceNet: an end-to-end 3d neural network for multiview stereopsis. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 2326–2334 (2017)
Zhang, J., Yao, Y., Li, S., Luo, Z., Fang, T.: Visibility-aware multi-view stereo network. Int. J. Comput. Vis. 131(1), 199–214 (2023)
Article Google Scholar
Hui, T., Loy, C.C., Tang, X.: Depth map super-resolution by deep multi-scale guidance. In: Proceedings of European Conference on Computer Vision (ECCV), pp. 353–369 (2016)
Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1646–1654 (2016)
Kim, J., Lee, J.K., Lee, K.M.: Deeply-recursive convolutional network for image super-resolution. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1637–1645 (2016)
Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 105–114 (2017)
Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 936–944 (2017)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Proceedings of International Conference MICCAI, pp. 234–241 (2015)
Kendall, A., et al.: End-to-end learning of geometry and context for deep stereo regression. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 66–75 (2017)
Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networks for single image super-resolution. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1132–1140 (2017)
Aanæs, H., Jensen, R.R., Vogiatzis, G., Tola, E., Dahl, A.B.: Large-scale data for multiple-view stereopsis. Int. J. Comput. Vis. 120(2), 153–168 (2016)
Article MathSciNet Google Scholar
Yao, Y., et al.: BlendedMVS: a large-scale dataset for generalized multi-view stereo networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1787–1796 (2020)
Knapitsch, A., Park, J., Zhou, Q.-Y., Koltun, V.: Tanks and temples: benchmarking large-scale scene reconstruction. ACM Trans. Graph. 36(4), 1–13 (2017)
Article Google Scholar
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In: Proceedings of NIPS Autodiff Workshop (2017)
Kingma, D.P., Ba, J.: Adam: a Method for Stochastic Optimization. arXiv, Jan. 29, 2017. Accessed: May 28, 2022. [Online]. http://arxiv.org/abs/1412.6980
Liao, M., Wan, Z., Yao, C., Chen, K., Bai, X.: Real-time scene text detection with differentiable binarization. AAAI 34(07), 11474–11481 (2020)
Article Google Scholar
Merrell, P., et al.: Real-time visibility-based fusion of depth maps. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 1–8 (2007)
Yan, J., et al.: Dense hybrid recurrent multi-view stereo net with dynamic consistency checking. In: Proceedings of European Conference on Computer Vision (ECCV), pp. 674–689 (2020)

Download references

Funding

This work was supported by the National Natural Science Foundation of China under Grant 61971339 and 61471161, the Natural Science Basic Research Program of Shaanxi under Grant 2023-JC-YB-826, the Scientific Research Program Funded by Shaanxi Provincial Education Department under Grant 22JP028, and the Postgraduate Innovation Fund of Xi'an Polytechnic University under Grant chx2022019.

Author information

Authors and Affiliations

School of Computer Science, Xi’an Polytechnic University, Xi’an, 710048, China
Jinguang Chen, Zonghua Yu, Lili Ma & Kaibing Zhang

Authors

Jinguang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zonghua Yu
View author publications
You can also search for this author in PubMed Google Scholar
Lili Ma
View author publications
You can also search for this author in PubMed Google Scholar
Kaibing Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinguang Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Why use softmax to preprocess feature groups?

Ideally, the more similar the features of different perspectives on the same depth plane, the closer the plane is to the true depth, and the higher the probability. Therefore, we believe that a good cost measurement method should satisfy the following points. First, the performance of the similarity measurement is good. Second, the more similar the features of the depth plane are, the larger the cost is, which is proportional. Third, the value range of the cost metric is consistent with the probability value range, which is between [1]. We choose the inner product operation as the main method of the cost metric (meeting the first point). In addition, compared to vector normalization, the gradient calculation of softmax normalization is simpler, and the value range of the inner product is kept in [1]. The feature group pretreated by softmax function reduces the fitting process, making VCR-Net and 3D CNN more efficient.

1.2 Why can VCR-Net improve the cost volume quality?

The functions of VCR-Net and the 3D CNN regularization network are similar. Both regularize the cost volume to obtain the probability value of each depth plane. The difference is that VCR-Net processes the cost volume of each view separately, takes the probability volume as the weight, and uses the sigmoid activation function. The features of the noise locations are mismatched, with low similarity, and a small weight is obtained by VCR-Net. When calculating the weighted average, a low weight is given to the noise to reduce its matching cost. The VCR-Net network structure is shown in Table

Table 7 Details of VCR-Net, where G is the number of feature groups, D is the number of depth samples, and H and W are the width and height of the feature map, respectively

Full size table

1.3 Visualization of point cloud results

All qualitative results of our method are shown in Figs.

6 and

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, J., Yu, Z., Ma, L. et al. Multi-distribution fitting for multi-view stereo. Machine Vision and Applications 34, 93 (2023). https://doi.org/10.1007/s00138-023-01449-4

Download citation

Received: 12 August 2022
Revised: 10 April 2023
Accepted: 12 August 2023
Published: 30 August 2023
DOI: https://doi.org/10.1007/s00138-023-01449-4