CN115170746A

CN115170746A - Multi-view three-dimensional reconstruction method, system and equipment based on deep learning

Info

Publication number: CN115170746A
Application number: CN202211087276.9A
Authority: CN
Inventors: 任胜兵; 彭泽文; 陈旭洋
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-09-07
Filing date: 2022-09-07
Publication date: 2022-10-11
Anticipated expiration: 2042-09-07
Also published as: CN115170746B

Abstract

The invention discloses a method, a system and equipment for multi-view three-dimensional reconstruction based on deep learning, wherein a plurality of multi-view images are obtained, multi-scale semantic feature extraction is carried out on the multi-view images, and feature maps of various scales are obtained; performing multi-scale semantic segmentation on the feature maps of various scales to obtain semantic segmentation sets of various scales; reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map; obtaining depth maps of various scales based on the semantic segmentation sets and the initial depth maps of various scales; constructing point cloud sets with various scales; optimizing the point cloud sets of various scales by adopting different radius filtering to obtain optimized point cloud sets; reconstructing at different scales based on the optimized point cloud set to obtain three-dimensional reconstruction results at different scales; and splicing and fusing the three-dimensional reconstruction results of each scale. The invention can fully utilize semantic information of each scale and improve the accuracy of three-dimensional reconstruction.

Description

Multi-view three-dimensional reconstruction method, system and equipment based on deep learning

Technical Field

The invention relates to the technical field of computer vision, in particular to a method, a system and equipment for multi-view three-dimensional reconstruction based on deep learning.

Background

The three-dimensional reconstruction method for deep learning is that a neural network is built by using a computer, training is carried out through a large amount of image data and three-dimensional model data, and the mapping relation from an image to a three-dimensional model is learned, so that the three-dimensional reconstruction of a new image target is realized. Compared with the traditional method such as a 3D digital media management (3D) method and a Structural From Motion (SFM), the three-dimensional reconstruction method for deep learning can introduce some learned global semantic information into image reconstruction, so that the limitation that the traditional reconstruction method is poor in reconstruction of weak illumination and weak texture areas is overcome to a certain extent.

The existing deep learning three-dimensional reconstruction method is mostly based on a single scale, namely, objects with different sizes in an image are reconstructed in the same way. The single-scale reconstruction can keep better reconstruction accuracy and speed under the environment with lower scene complexity and fewer fine objects. However, the problem of insufficient reconstruction accuracy of small-scale objects easily occurs in some environments with complex scenes and more objects of various scales. And only the high-level features are utilized, and the low-level detail information of the image is not fully utilized.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a multi-view three-dimensional reconstruction method, a multi-view three-dimensional reconstruction system and multi-view three-dimensional reconstruction equipment based on deep learning, which can make full use of semantic information of each scale and improve the accuracy of three-dimensional reconstruction.

In a first aspect, an embodiment of the present invention provides a deep learning-based multi-view three-dimensional reconstruction method, where the deep learning-based multi-view three-dimensional reconstruction method includes:

acquiring a plurality of multi-view images, and performing multi-scale semantic feature extraction on the plurality of multi-view images to obtain feature maps of various scales;

performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain semantic segmentation sets of multiple scales;

reconstructing the multiple multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map;

obtaining depth maps of multiple scales based on the semantic segmentation sets of multiple scales and the initial depth map;

constructing a point cloud set with multiple scales based on the depth maps with multiple scales;

according to the scale of the point cloud set, different radius filtering is adopted for the point cloud sets with various scales to carry out optimization, and the optimized point cloud set is obtained;

reconstructing at different scales based on the optimized point cloud set to obtain three-dimensional reconstruction results at different scales;

and splicing and fusing the three-dimensional reconstruction results of each scale to obtain a final three-dimensional reconstruction result.

Compared with the prior art, the first aspect of the invention has the following beneficial effects:

the method can extract the features of different scales by extracting the multi-scale semantic features of a plurality of multi-view images, can obtain the feature maps of various scales, can perform multi-scale semantic segmentation on the feature maps of various scales, and can aggregate the semantic information of each scale, thereby enriching the semantic information of each scale; semantic guidance is respectively carried out on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and the accurate depth map of multiple scales is obtained; the method comprises the steps of constructing a point cloud set with various scales by using the obtained depth maps with various scales, optimizing by adopting different radius filtering according to the scales of the point cloud set, using the optimized point cloud set for reconstruction with different scales, and fusing three-dimensional reconstruction results to obtain more accurate three-dimensional reconstruction results. Therefore, the method can fully utilize semantic information of each scale and improve the accuracy of three-dimensional reconstruction.

According to some embodiments of the present invention, the performing multi-scale semantic feature extraction on a plurality of the multi-view images to obtain feature maps of multiple scales includes:

performing multilayer feature extraction on the multiple multi-view images through a ResNet network to obtain original feature maps with multiple scales;

and respectively connecting the original feature map of each scale with channel attention so as to carry out importance weighting on the original feature map of each scale through a channel attention mechanism and obtain feature maps of various scales.

According to some embodiments of the present invention, the importance weighting is performed on the original feature map of each scale through a channel attention mechanism to obtain feature maps of multiple scales, including:

compressing the original characteristic diagram of each scale through a compression network to obtain a one-dimensional characteristic diagram corresponding to the original characteristic diagram of each scale;

inputting the one-dimensional characteristic diagram into a full-connection layer through an excitation network to perform importance prediction, and obtaining the importance of each channel;

and exciting the importance of each channel to the one-dimensional characteristic diagram of the original characteristic diagram of each scale through an excitation function to obtain characteristic diagrams of various scales.

According to some embodiments of the present invention, the performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain a semantic segmentation set of multiple scales includes:

clustering the characteristic graphs of multiple scales through nonnegative matrix decomposition to obtain semantic segmentation sets of multiple scales; wherein the expression of the non-negative matrix factorization is:

the method comprises the following steps that V represents a matrix V which is formed by mapping, connecting and reshaping feature maps of various scales into HW rows and C columns, P represents a matrix of the HW rows and K columns, Q represents a matrix of the K rows and C columns, H represents a coefficient matrix, W represents a base matrix, K represents a non-negative matrix decomposition factor of semantic cluster number, C represents the dimension of each pixel, and F represents the adoption of a non-induced norm.

According to some embodiments of the present invention, the obtaining the depth maps of the plurality of scales based on the semantic segmentation sets of the plurality of scales and the initial depth map comprises:

selecting any one of the multiple multi-view images as a reference image, and taking the other images as images to be matched;

selecting a reference point from the reference image, acquiring a semantic category corresponding to the reference point in the semantic segmentation set, and acquiring a depth value corresponding to the reference point on the initial depth image;

the number of reference points is chosen by the following formula:

wherein,

representing the number of reference points selected by the jth segmentation set, H representing the height of the multi-view image, W representing the width of the multi-view image, HW representing the number of pixel points of the multi-view image, t representing a constant parameter,

representing the number of semantic categories contained in the jth of said semantic segmentation sets,

representing the number of semantic categories contained in the ith semantic segmentation set;

based on each reference point, obtaining the matching point of each reference point on the graph to be matched through the following formula:

wherein,

representing the matching point of the ith reference point on the graph to be matched, K representing the internal reference of the camera, T representing the external reference of the camera,

representing a reference point P in said reference map _i Corresponding depth values on the initial depth map;

obtaining the semantic category corresponding to each matching point, correcting the multi-view image of each scale by minimizing a semantic loss function to obtain the depth maps of multiple scales, wherein the semantic loss function

The calculation formula of (a) is as follows:

wherein,

representing the difference between the semantic information of the ith reference point and the semantic information of the ith matching point, M _i Representing a mask and N representing the number of said reference points.

According to some embodiments of the invention, the constructing a point cloud set of multiple scales based on the depth maps of multiple scales comprises:

constructing a point cloud set of each scale by using the depth map of each scale according to the following expression:

wherein,

the abscissa representing the depth map is shown as,

represents the ordinate of the depth map and,

and

representing the camera focal length obtained from the camera parameters, and x, y and z represent the point cloud coordinates of the point cloud transformation.

According to some embodiments of the present invention, the optimizing the point cloud sets of multiple scales by using different radius filtering according to the scales of the point cloud sets to obtain an optimized point cloud set includes:

acquiring the point cloud sets of multiple scales, wherein the point cloud in the point cloud set of each scale has a corresponding radius and a preset number of adjacent points;

calculating the corresponding radius of the point cloud in the point cloud set by adopting the following formula according to the scale of the point cloud set:

wherein,

representing the corresponding radius of the point cloud in the point cloud set with different scales,

representing a constant parameter, t representing a constant parameter,

representing a preset scale grade of each point cloud set;

and optimizing the point cloud sets with various scales according to the radius corresponding to each point cloud and the preset number of adjacent points to obtain an optimized point cloud set.

In a second aspect, an embodiment of the present invention further provides a deep learning-based multi-view three-dimensional reconstruction system, where the deep learning-based multi-view three-dimensional reconstruction system includes:

the characteristic diagram acquisition unit is used for acquiring multi-view images, and performing multi-scale semantic feature extraction on the multi-view images to acquire characteristic diagrams of multiple scales;

the semantic segmentation set acquisition unit is used for carrying out multi-scale semantic segmentation on the feature maps with various scales to acquire a semantic segmentation set with various scales;

the initial depth map acquisition unit is used for reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map;

the depth map acquisition unit is used for acquiring depth maps of multiple scales based on the semantic segmentation sets of multiple scales and the initial depth map;

the point cloud set acquisition unit is used for constructing point cloud sets of multiple scales on the basis of the depth maps of the multiple scales;

the radius filtering unit is used for optimizing the point cloud sets with various scales by adopting different radius filtering according to the scales of the point cloud sets to obtain the optimized point cloud sets;

a reconstruction result obtaining unit, configured to perform reconstruction of different scales based on the optimized point cloud set, so as to obtain three-dimensional reconstruction results of different scales;

and the reconstruction result fusion unit is used for splicing and fusing the reconstruction results of each scale to obtain a final three-dimensional reconstruction result.

Compared with the prior art, the second aspect of the invention has the following beneficial effects:

the feature map acquisition unit of the system can extract deep features by performing multi-scale semantic feature extraction on a plurality of multi-view images, can acquire feature maps of various scales, performs multi-scale semantic segmentation on the feature maps of various scales by the semantic segmentation set acquisition unit, aggregates semantic information of various scales, and enriches the semantic information of various scales; the depth map acquisition unit is used for respectively carrying out semantic guidance on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and the accurate depth map of multiple scales is obtained; a point cloud set acquisition unit of the system constructs a point cloud set with multiple scales by using the acquired depth maps with multiple scales, different radius filtering is adopted for optimization according to the scales of the point cloud set through a radius filtering unit, reconstruction with different scales is carried out on the basis of the optimized point cloud set through a reconstruction result acquisition unit, and then a three-dimensional reconstruction result is fused through a reconstruction result fusion unit to obtain a more accurate three-dimensional reconstruction result. Therefore, the system can make full use of semantic information of each scale and improve the accuracy of three-dimensional reconstruction.

In a third aspect, an embodiment of the present invention further provides a deep learning-based multi-view three-dimensional reconstruction apparatus, including at least one control processor and a memory, which is in communication connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform a method of deep learning based multi-view three-dimensional reconstruction as described above.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where computer-executable instructions are stored, and the computer-executable instructions are configured to enable a computer to execute a method for deep learning based multi-view three-dimensional reconstruction as described above.

It is to be understood that the advantageous effects of the third aspect to the fourth aspect compared to the related art are the same as the advantageous effects of the first aspect compared to the related art, and reference may be made to the related description in the first aspect, which is not repeated herein.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart of a deep learning-based multi-view three-dimensional reconstruction method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a depth residual network in accordance with one embodiment of the present invention;

FIG. 3 is a schematic diagram of a non-negative matrix factorization of an embodiment of the present invention;

FIG. 4 is a block diagram of multi-scale semantic segmentation in accordance with an embodiment of the present invention;

fig. 5 is a structural diagram of a deep learning-based multi-view three-dimensional reconstruction system according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

In the description of the present invention, if there are first, second, etc. described, it is only for the purpose of distinguishing technical features, and it is not understood that relative importance is indicated or implied or that the number of indicated technical features is implicitly indicated or that the precedence of the indicated technical features is implicitly indicated.

In the description of the present invention, it should be understood that the orientation or positional relationship referred to, for example, the upper, lower, etc., is indicated based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplification of description, but does not indicate or imply that the device or element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention.

In the description of the present invention, it should be noted that unless otherwise explicitly defined, terms such as setup, installation, connection, etc. should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention by combining the detailed contents of the technical solutions.

For the convenience of understanding of those skilled in the art, the terms in the present embodiment are explained:

the deep learning three-dimensional reconstruction method comprises the following steps: the three-dimensional reconstruction method for deep learning is that a neural network is built by a computer, training is carried out through a large amount of image data and three-dimensional model data, and the mapping relation between an image and a three-dimensional model is learned, so that three-dimensional reconstruction of a new image target is realized. Compared with the traditional method for reconstructing three-dimensional information such as 3DMM and the method for reconstructing three-dimensional information by SFM, the three-dimensional reconstruction method for deep learning can introduce some global semantic information into image reconstruction, thereby overcoming the limitation that the traditional reconstruction method is poor in reconstruction in weak illumination and weak texture areas to a certain extent, wherein the SFM algorithm is an off-line algorithm for three-dimensional reconstruction based on various collected disordered pictures; the 3DMM, a three-dimensional deformable face model, is a general three-dimensional face model, and represents a face by using fixed points.

The current three-dimensional reconstruction methods for deep learning can be mainly classified into supervised three-dimensional reconstruction methods (for example, NVSNet, CVP-MVSNet, patchmatchchnet and the like in the prior art) and self-supervised three-dimensional reconstruction methods (for example, JDACS-MS and the like in the prior art). The supervised three-dimensional reconstruction method needs truth values for training, has high precision, and is difficult to apply in some scenes in which truth values are difficult to acquire. The self-supervision three-dimensional reconstruction method does not need real value training, and has wide application range and relatively low precision.

Semantic segmentation: semantic segmentation is a classification at the pixel level, and pixels belonging to the same class are classified into one class, so that semantic segmentation is used for understanding an image from the pixel level, for example, pixels having different semantics are marked with different colors. Pixels belonging to animals are classified into the same class. The segmented semantic information can guide image reconstruction, and reconstruction accuracy is improved. And performing semantic segmentation by adopting a clustering mode, and clustering pixels belonging to the same class into the same class.

Depth map: a distance image is an image in which the distance (depth) value from an image capture device to each point in a scene is defined as a pixel value.

Point cloud: the point data set of the object appearance surface is point cloud, contains information such as three-dimensional coordinate information and color of the object, and can realize image reconstruction through the point cloud data.

non-Negative Matrix Factorization (NMF): is a matrix decomposition method under the constraint that all elements in the matrix are non-negative. There are many analysis methods for solving the practical problem by matrix decomposition, such as PCA (principal component analysis), ICA (independent component analysis), SVD (singular value decomposition), VQ (vector quantization), and the like. In all these methods, the original large matrix V is approximately decomposed into a low rank V = WH form. The common feature of these methods is that the elements in the factors W and H can be positive or negative, and even if the input initial matrix elements are all positive, the non-negativity of the original data cannot be guaranteed by the conventional rank reduction algorithm. Mathematically, it is true from a computational point of view that the presence of negative values in the decomposition results is correct, but negative values elements often make no sense in practical problems.

The three-dimensional reconstruction method for deep learning is that a neural network is built by using a computer, training is carried out through a large amount of image data and three-dimensional model data, and the mapping relation from an image to a three-dimensional model is learned, so that the three-dimensional reconstruction of a new image target is realized. Compared with the traditional method such as a 3DMM method and an SFM method, the three-dimensional reconstruction method based on deep learning can introduce some learned global semantic information into image reconstruction, so that the limitation that the traditional reconstruction method is poor in reconstruction in weak-illumination and weak-texture areas is overcome to a certain extent.

The existing deep learning three-dimensional reconstruction method is mostly based on a single scale, namely, the objects with different sizes in the image are reconstructed in the same way. The single-scale reconstruction can keep better reconstruction accuracy and speed under the environment with lower scene complexity and fewer fine objects. However, the problem of insufficient reconstruction accuracy of small-scale objects easily occurs in some environments with complex scenes and more objects of various scales. And only the high-level features are utilized, and the low-level detail information of the image is not fully utilized.

In order to solve the problems, the multi-scale semantic feature extraction is carried out on a plurality of multi-view images, features of different scales can be extracted, feature maps of various scales can be obtained, multi-scale semantic segmentation is carried out on the feature maps of various scales, semantic information of various scales is aggregated, and the semantic information of various scales is enriched; semantic guidance is respectively carried out on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and an accurate depth map of multiple scales is obtained; the method and the device construct the point cloud sets of various scales by using the obtained depth maps of various scales, optimize by adopting different radius filtering according to the scales of the point cloud sets, use the optimized point cloud sets for reconstruction of different scales, and fuse three-dimensional reconstruction results to obtain more accurate three-dimensional reconstruction results. Therefore, the method and the device can make full use of semantic information of all scales, and can improve the accuracy of three-dimensional reconstruction.

Referring to fig. 1, an embodiment of the present invention provides a deep learning-based multi-view three-dimensional reconstruction method, where the deep learning-based multi-view three-dimensional reconstruction method includes:

s100, acquiring a plurality of multi-view images, and performing multi-scale semantic feature extraction on the plurality of multi-view images to obtain feature maps of various scales.

Specifically, a plurality of multi-view images are acquired, and the object to be recognized can be subjected to image acquisition at various angles in all directions through image acquisition equipment such as a camera and an image scanner, so that the plurality of multi-view images are obtained. For example, when multi-scale semantic feature extraction needs to be performed on multiple multi-view images, multiple multi-view images can be obtained by using an image acquisition device such as a camera.

In the embodiment, multilayer feature extraction is performed on a plurality of multi-view images through a ResNet network to obtain original feature maps with various scales;

respectively connecting the original feature map of each scale with channel attention, and performing importance weighting on the original feature map of each scale through a channel attention mechanism to obtain feature maps of multiple scales, specifically:

inputting the one-dimensional characteristic diagram into a full connection layer through an excitation network to carry out importance prediction, and obtaining the importance of each channel;

In the embodiment, the ResNet network is adopted to extract the image characteristics, the deeper the number of layers of the deep learning network is, the stronger the expression capability theoretically is, but after the CNN network reaches a certain depth, the classification performance is deepened, so that the network convergence is slower, and the accuracy is reduced; even if the data set is enlarged to solve the problem of overfitting, the classification performance and accuracy will not be improved. The ResNet network adopts a residual error learning method, and with reference to FIG. 2, the learned characteristics are recorded as x when the input is x

Now we want it to learn the residual

Such that the actual original learning features are

. This is so because residual learning is easier than direct learning of the original features. When the residual error is 0, the accumulation layer only performs identity mapping at this time, at least the network performance is not reduced, and actually the residual error is not 0, so that the accumulation layer can learn new features on the basis of the input features, and has better performance. The residual function is easier to optimize, and the number of network layers can be greatly deepened, so that deeper semantic information can be extracted. The performance of ResNet in the aspects of efficiency, resource consumption and deep semantic feature extraction is obviously superior to that of networks such as VGG (virtual grid generator) and the like.

After multi-layer feature extraction is carried out on a plurality of multi-view images through a ResNet network to obtain original feature maps of various scales, the original feature maps of each scale are respectively connected with channel attention, and importance weighting is carried out on the original feature maps of each scale through a channel attention mechanism to obtain the feature maps of various scales. The channel attention mechanism mainly comprises a compression network and an excitation network, and comprises the following specific processes:

let the dimension of the original feature map be H × W × C, where H is Height (Height), W is width (width), and C is channel number (channel). The compression network does the same by compressing H x W C to 1 x 1C, which is equivalent to compressing H x W to one-dimensional features, by global averaging pooling. After H W is compressed into one dimension, the corresponding one-dimensional parameters obtain the previous H W global view, and the sensing area is wider. And transmitting the one-dimensional characteristics obtained by the compression network to an excitation network, transmitting the one-dimensional characteristics to a full connection layer by the excitation network, predicting the importance of each channel to obtain the importance of different channels, and exciting the importance of different channels to the channels corresponding to the previous characteristic diagrams by a Sigmoid excitation function. The channel attention mechanism enables the network to pay attention to more effective semantic features, the weight of the semantic features is improved in an iterative mode, the feature extraction network extracts rich semantic features, and the importance of different semantic features to semantic segmentation is different. The introduction of the channel attention mechanism can enable the network to pay attention to more effective features, inhibit inefficient features and improve the effectiveness of feature extraction.

In this embodiment, because of the convolutional neural network used in feature extraction in the prior art, feature extraction like VGG network is limited by the number of network extraction layers, the deep level feature extraction capability is insufficient, and the feature validity is not high. With the increase of the number of the convolution layers, the problems of slow network convergence, low accuracy and the like occur, the feature extraction capability is insufficient, all the extracted features have different importance for image reconstruction, and the extraction of the features with high effectiveness is difficult to guarantee. Therefore, in the embodiment, by performing multi-scale semantic feature extraction on a plurality of multi-view images, deep features can be extracted, and feature maps of various scales can be obtained. And through the introduction of a channel attention mechanism, the network can pay attention to more effective features, the inefficient features are restrained, and the feature extraction effectiveness is improved.

And S200, performing multi-scale semantic segmentation on the feature maps of various scales to obtain a semantic segmentation set of various scales.

Specifically, clustering is carried out on feature maps of multiple scales through nonnegative matrix decomposition, and semantic segmentation sets of multiple scales are obtained; wherein the expression of the non-negative matrix factorization is as follows:

A typical matrix decomposition decomposes a large matrix into many smaller matrices, but the elements of these matrices have positive and negative values. In the real world, the existence of negative numbers in a matrix formed by images, texts and the like is meaningless, so that it is meaningful to decompose a matrix into all non-negative elements. Requiring a raw matrix in NMF

Is non-negative, then the matrix

Can be decomposed into the product of two smaller non-negative matrices with one and only one such decomposition satisfying presence and uniqueness. For example,

given matrix

Looking for non-negative matrices

And a non-negative matrix

So that

. Before and after decomposition it is understood that: original matrix

The column vector of (1) is the weighted sum of all the column vectors in the left matrix, and the weighting coefficient is the element of the corresponding column vector of the right matrix, so called

As a basis matrix, the matrix is,

is a matrix of coefficients.

Referring to fig. 3, N multi-scale feature maps are first concatenated and reshaped into a (HW, C) matrix V. Solving NMF using multiplicative update rules, i.e. using formulae

，

Solving for NMF, V is decomposed by NMF decomposition (i.e., NMF non-negative matrix decomposition) in the graph into (HW, K) matrix P and (K, C) matrix Q, where K is the NMF factor representing the number of semantic clusters. Due to NMF (QQ) ^T = I), each row of the (K, C) matrix Q may be considered as a C-dimensional cluster center, each row of the K, C) matrix Q corresponding to several objects in the view. The rows of the (HW, K) matrix P correspond to the positions of all pixels from the N multi-scale feature maps. In general, matrix decomposition forces the product between each row of P and each column of Q to better approximate the C-dimensional characteristics of each pixel in V. Thus, the semantic category of each position in the image is obtained by the P matrix.

Referring to FIG. 4, assume an extracted feature map

Each feature is semantically segmented by means of clustering (i.e., NMF non-negative matrix factorization)Matrix array

Is decomposed into

And because the receptive field of the high-level feature layer is large, the features are more abstract, and the global situation is more concerned. The lower characteristic layer has small receptive field and focuses more on details. Thus, each segmented set obtained by multi-scale semantic segmentation

The system comprises a plurality of layers from coarse to fine. The segmentation sets S1 to S3 in fig. 4 contain increasingly more detailed information. Each segmentation set S contains semantic segmentation results of an input set of images (reference image and image to be matched), for example, different colors represent different semantic categories, and a segmentation set (e.g., segmentation set S3) containing more detailed information contains more semantic categories.

In the embodiment, because most of the current deep learning three-dimensional reconstruction methods are based on a single scale, the three-dimensional reconstruction methods are reconstructed in the same manner for objects with different sizes in the image. The single-scale reconstruction can keep better reconstruction accuracy and speed in the environment with lower scene complexity and fewer small objects, but the problem of insufficient reconstruction accuracy of small-scale objects easily occurs in the environment with complex scenes and more objects of various scales; and only the high-level features are utilized, and the low-level detail information of the image is not fully utilized. Therefore, in the embodiment, the feature maps of multiple scales are subjected to multi-scale semantic segmentation, semantic information of each scale is aggregated, the semantic information of each scale is enriched, and the detail information of the low-level feature layer can be fully utilized.

And S300, reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map.

Specifically, in the embodiment, a plurality of multi-view images are reconstructed by a supervised three-dimensional reconstruction method, so as to obtain an initial depth map.

According to the embodiment, the initial depth map is obtained through a supervised three-dimensional reconstruction method, and the reconstruction precision can be improved. Because the supervised three-dimensional reconstruction method has high precision, but needs a large amount of training truth value data, and under certain specific scenes (for example, underwater), the training truth value is difficult to acquire and is difficult to apply. Therefore, step S400 is required to perform semantic guidance on the initial depth map of this embodiment, and the supervised three-dimensional reconstruction method is converted into an unsupervised one, so as to implement the unsupervised three-dimensional reconstruction, thereby overcoming the inherent defects of the supervised three-dimensional reconstruction method.

The supervised three-dimensional reconstruction method in this embodiment is any one of the supervised three-dimensional reconstruction methods in the prior art, for example, MVSNet (Depth index for unstrained multiple-View step), CVP-MVSNet (Cost Volume farm Based Depth index for multiple-View step), and PatchmatchNet (PatchmatchNet: left multiple-View patchmat step), and the detailed description thereof is omitted.

And S400, obtaining the depth maps of various scales based on the semantic segmentation sets and the initial depth map of various scales.

Specifically, in this embodiment, semantic information is used as a supervision signal to combine with a supervised three-dimensional reconstruction method, and the image reconstruction is guided to obtain a depth map, which specifically includes the following processes:

acquiring a plurality of multi-view images through image acquisition equipment, and taking the plurality of multi-view images as input to obtain an initial depth map through a supervised three-dimensional reconstruction method;

selecting a reference point from the reference image, acquiring a semantic category corresponding to the reference point in the semantic segmentation set, and acquiring a depth value corresponding to the reference point on the initial depth map;

the number of reference points is chosen by the following formula:

wherein,

representing the number of semantic categories contained in the jth semantic partition set,

based on each reference point, acquiring the matching point of each reference point on the graph to be matched through the following formula:

wherein,

the matching point of the ith reference point on the graph to be matched is shown, K represents the internal reference of the camera, T represents the external reference of the camera,

representing a reference point P in a reference picture _i Corresponding depth values on the initial depth map;

obtaining the semantic category corresponding to each matching point, correcting the multi-view image of each scale by minimizing a semantic loss function to obtain depth maps of various scales, and obtaining the semantic loss function

The calculation formula of (a) is as follows:

wherein,

representing the difference between the semantic information of the ith reference point and the semantic information of the ith matching point, M _i Representing the mask and N the number of reference points. This embodiment is illustrated by the following example:

firstly, a plurality of multi-view images of the same object under different view angles are obtained through an image acquisition device, the multi-view images are used as input, and an initial depth map can be obtained through a supervised three-dimensional reconstruction method. Selecting one of the input multi-view images as a reference image, and taking a point of reference point P on the reference image, wherein the rest images are images to be matched _i And its corresponding semantic class S on the segmented set S _i And a corresponding depth value on the depth map.

For the segmentation sets of different levels, the segmentation sets with more categories need to be guided more finely due to different semantic category numbers, the number of reference points needs to be more, and the number of the reference points is selected according to a formula:

the matching points corresponding to the reference points on the graph to be matched are obtained through the following homography matrix formula

：

Taking matching points

Semantic categories of

The semantic category of the matching point calculated by the reference point under the condition that the depth map is accurate (i.e. the depth value of the corresponding position is correct) should be the same as the semantic category of the reference point, and the following semantic loss function is calculated and minimized:

and continuously correcting the initial depth map by minimizing a semantic loss function, and finally obtaining an accurate depth map. The semantic information can replace a truth value for guiding, a supervised three-dimensional reconstruction method is converted into an unsupervised three-dimensional reconstruction method, and the self-supervised three-dimensional reconstruction is realized, so that the inherent defects of the supervised method are overcome.

In this embodiment, since the semantics of the image can be divided into three layers, a visual layer, an object layer and a concept layer, the semantics of the visual layer includes colors, lines, contours, etc., the semantics of the object layer includes various objects, and the semantics of the concept layer relates to understanding of the scene. In the prior art, part of three-dimensional reconstruction methods also use semantic information guidance, but high-level abstract semantic information (object layer) with a single scale has better precision on reconstruction tasks of large-scale objects, and on reconstruction tasks with small scales, the high-level abstract semantic information is relatively rough and the reconstruction precision is poor.

Therefore, in the embodiment, a plurality of multi-view images are used as input, and an initial depth map is obtained through a supervised three-dimensional reconstruction method; obtaining depth maps of various scales based on semantic segmentation sets and initial depth maps of various scales; in the embodiment, semantic guidance is respectively performed on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and an accurate depth map of multiple scales is obtained.

And S500, constructing a point cloud set with various scales based on the depth maps with various scales.

Specifically, a point cloud set of each scale is constructed by the following expression for the depth map of each scale:

wherein,

the abscissa representing the depth map is shown,

the ordinate of the depth map is represented,

and

And S600, according to the scale of the point cloud set, optimizing the point cloud sets with various scales by adopting different radius filtering to obtain the optimized point cloud set.

Specifically, a point cloud set with multiple scales is obtained, and the point cloud in the point cloud set with each scale has a corresponding radius and a preset number of adjacent points;

calculating the corresponding radius of the point cloud in the point cloud set according to the scale of the point cloud set by adopting the following formula:

wherein,

representing a constant parameter, t representing a constant parameter,

representing a preset scale grade of each point cloud set;

and optimizing the point cloud sets with various scales according to the radius size corresponding to each point cloud and the preset number of adjacent points to obtain the optimized point cloud sets.

In this embodiment, for point cloud sets of different scales, radius filtering is required after depth map conversion, noise points are filtered, and point cloud data are optimized. For point cloud sets with different scales, different radius filtering is adopted due to different aggregation degrees of the point clouds. Radius filtering, namely, firstly, acquiring the radius corresponding to each point cloud and presetting the quantity of adjacent points, only the point cloud which meets the requirement of having enough quantity of adjacent points in the radius range can be reserved, and the rest points are filtered out. For the multi-scale point cloud set of this embodiment, the semantic type of the point cloud in the segmentation set needs to be considered, that is, the point cloud having n number of neighboring points with the same semantic type in the radius is retained.

And S700, reconstructing at different scales based on the optimized point cloud set to obtain three-dimensional reconstruction results at different scales.

Specifically, in step S600, point cloud sets of different scales are optimized to obtain point cloud sets optimized in different scales, and the point cloud sets optimized in each scale are reconstructed to obtain three-dimensional reconstruction results in different scales.

And step S800, splicing and fusing the three-dimensional reconstruction results of each scale to obtain a final three-dimensional reconstruction result.

Specifically, the three-dimensional reconstruction results of each scale are spliced and fused to obtain the final three-dimensional reconstruction result. In this embodiment, through the step S700, the reconstruction of different scales is performed based on the optimized point cloud set, and the optimized point cloud set is more accurate, so that the final three-dimensional reconstruction result obtained in this embodiment is also more accurate.

In the embodiment, a plurality of multi-view images are obtained, and multi-scale semantic feature extraction is performed on the plurality of multi-view images to obtain feature maps of various scales; performing multi-scale semantic segmentation on the feature maps of various scales to obtain semantic segmentation sets of various scales; in the embodiment, the deep-level features can be extracted by performing multi-scale semantic feature extraction on a plurality of multi-view images, and feature maps of various scales can be obtained. And multi-scale semantic segmentation is carried out on the feature maps of various scales, and semantic information of each scale is aggregated, so that the semantic information of each scale is enriched. In the embodiment, a plurality of multi-view images are used as input, and an initial depth map is obtained through a supervised three-dimensional reconstruction method; obtaining depth maps of various scales based on the semantic segmentation sets and the initial depth maps of various scales; in the embodiment, semantic guidance is respectively performed on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and an accurate depth map of multiple scales is obtained. The method comprises the steps of constructing a point cloud set with various scales based on depth maps with various scales; according to the scale of the point cloud set, optimizing the point cloud sets of various scales by adopting different radius filtering to obtain the optimized point cloud set; reconstructing at different scales based on the optimized point cloud set to obtain reconstruction results at different scales; and splicing and fusing the reconstruction results of each scale to obtain a final reconstruction result. In this embodiment, the obtained depth maps of multiple scales are used to construct a point cloud set of multiple scales, different radius filtering is adopted for optimization according to the scales of the point cloud set, the optimized point cloud set is used for reconstruction of different scales, and then the reconstruction results are fused to obtain a more accurate reconstruction result. According to the embodiment, semantic information of each scale can be fully utilized, and the accuracy of three-dimensional reconstruction can be improved.

Referring to fig. 5, an embodiment of the present invention provides a deep learning-based multi-view three-dimensional reconstruction system, which includes a feature map obtaining unit 100, a semantic segmentation set obtaining unit 200, an initial depth map obtaining unit 300, a depth map obtaining unit 400, a point cloud set obtaining unit 500, a radius filtering unit 600, a reconstruction result obtaining unit 700, and a reconstruction result fusion unit 800, where:

the feature map acquiring unit 100 is configured to acquire a multi-view image, perform multi-scale semantic feature extraction on the multi-view image, and acquire feature maps of multiple scales;

a semantic division set obtaining unit 200, configured to perform multi-scale semantic division on feature maps of multiple scales to obtain a semantic division set of multiple scales;

an initial depth map obtaining unit 300, configured to reconstruct the multiple multi-view images by using a supervised three-dimensional reconstruction method, so as to obtain an initial depth map;

a depth map obtaining unit 400, configured to obtain depth maps of multiple scales based on the multiple-scale semantic segmentation sets and the initial depth map;

a point cloud set obtaining unit 500, configured to construct a point cloud set of multiple scales based on depth maps of multiple scales;

the radius filtering unit 600 is configured to optimize point cloud sets of multiple scales by using different radius filtering according to the scale of the point cloud set, so as to obtain an optimized point cloud set;

a reconstruction result obtaining unit 700, configured to perform reconstruction of different scales based on the optimized point cloud set, so as to obtain three-dimensional reconstruction results of different scales;

and a reconstruction result fusion unit 800, configured to splice and fuse the reconstruction results of each scale to obtain a final three-dimensional reconstruction result.

It should be noted that, since the multi-view three-dimensional reconstruction system based on deep learning in the present embodiment is based on the same inventive concept as the above-mentioned multi-view three-dimensional reconstruction method based on deep learning, the corresponding contents in the method embodiments are also applicable to the present system embodiment, and are not described in detail herein.

The embodiment of the invention also provides a multi-view three-dimensional reconstruction device based on deep learning, which comprises: at least one control processor and a memory for communicative connection with the at least one control processor.

The memory, as a non-transitory computer-readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer-executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software program and instructions required to implement a deep learning based multi-view three-dimensional reconstruction method of the above embodiments are stored in a memory, and when executed by a processor, perform the deep learning based multi-view three-dimensional reconstruction method of the above embodiments, for example, perform the above-described method steps S100 to S800 in fig. 1.

The above described system embodiments are merely illustrative, wherein the units described as separate components may or may not be physically separate, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Embodiments of the present invention also provide a computer-readable storage medium, which stores computer-executable instructions, which, when executed by one or more control processors, may cause the one or more control processors to perform one of the above method embodiments based on deep learning, for example, perform the functions of the above method steps S100 to S800 in fig. 1.

Through the above description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a general hardware platform. Those skilled in the art will appreciate that all or part of the processes of the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.

The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims

1. A multi-view three-dimensional reconstruction method based on deep learning is characterized by comprising the following steps:

constructing a point cloud set with various scales based on the depth maps with various scales;

2. The deep learning-based multi-view three-dimensional reconstruction method according to claim 1, wherein the performing multi-scale semantic feature extraction on the multiple multi-view images to obtain feature maps of multiple scales comprises:

3. The deep learning-based multi-view three-dimensional reconstruction method according to claim 2, wherein the obtaining of feature maps of multiple scales by weighting the importance of the original feature map of each scale through a channel attention mechanism comprises:

4. The deep learning-based multi-view three-dimensional reconstruction method according to claim 1, wherein the performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain semantic segmentation sets of multiple scales comprises:

the method comprises the following steps of mapping, connecting and remolding feature maps of various scales into a matrix V with HW rows and C columns, wherein the P represents a matrix with HW rows and K columns, the Q represents a matrix with K rows and C columns, the H represents a coefficient matrix, the W represents a base matrix, the K represents a non-negative matrix decomposition factor of a semantic cluster number, the C represents the dimension of each pixel, and the F represents the adoption of a non-inducible norm.

5. The method according to claim 1, wherein obtaining the depth maps of multiple scales based on the semantic segmentation sets of multiple scales and the initial depth map comprises:

the number of reference points is chosen by the following formula:

wherein,

representing the number of semantic categories contained in the jth said semantic partition set,

wherein,

obtaining semantic categories corresponding to each matching point, correcting the multi-view images of each scale by minimizing a semantic loss function to obtain the depth maps of various scales, wherein the semantic loss function

The calculation formula of (a) is as follows:

wherein,

6. The method for multi-view three-dimensional reconstruction based on deep learning according to claim 5, wherein the constructing a multi-scale point cloud set based on the multi-scale depth maps comprises:

constructing a point cloud set of each scale through the following expression for the depth map of each scale:

wherein,

the abscissa representing the depth map is shown as,

represents the ordinate of the depth map and,

and

7. The deep learning-based multi-view three-dimensional reconstruction method according to claim 1, wherein the optimization of the point cloud sets of multiple scales by using different radius filters according to the scales of the point cloud sets to obtain an optimized point cloud set comprises:

wherein,

representing a constant parameter, t representing a constant parameter,

representing a preset scale grade of each point cloud set;

and optimizing the point cloud sets with various scales according to the radius corresponding to each point cloud and the preset number of adjacent points to obtain the optimized point cloud set.

8. A deep learning based multi-view three-dimensional reconstruction system, characterized in that the deep learning based multi-view three-dimensional reconstruction system comprises:

the characteristic diagram acquisition unit is used for acquiring multi-view images and extracting multi-scale semantic characteristics of the multi-view images to acquire characteristic diagrams of multiple scales;

9. A deep learning based multi-view three-dimensional reconstruction device comprising at least one control processor and a memory for communicative connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform the method of deep learning based multi-view three-dimensional reconstruction as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method of deep learning based multi-view three-dimensional reconstruction according to any one of claims 1 to 7.