CN113592913A

CN113592913A - Method for eliminating uncertainty of self-supervision three-dimensional reconstruction

Info

Publication number: CN113592913A
Application number: CN202110907900.4A
Authority: CN
Inventors: 许鸿斌; 周志鹏; 乔宇; 康文雄
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2021-08-09
Filing date: 2021-08-09
Publication date: 2021-11-02
Anticipated expiration: 2041-08-09
Also published as: CN113592913B

Abstract

The invention discloses a method for eliminating uncertainty of self-supervision three-dimensional reconstruction. The method comprises the following steps: pre-training a deep learning three-dimensional reconstruction model by taking a set first loss function as a target, wherein the deep learning three-dimensional reconstruction model takes a view angle pair image consisting of a reference view angle and a source view angle as input, the first loss function is constructed on the basis of photometric stereo consistency loss and deep optical flow consistency loss, and the deep optical flow consistency loss represents pixels of the source view angle and pseudo optical flow information formed by matching points of the pixels under the reference view angle; training a pre-trained deep learning three-dimensional reconstruction model by taking a set second loss function as an optimization target, wherein the second loss function is constructed by estimating an uncertainty mask in a pre-training stage, and the uncertainty mask is used for representing an effective region in an input image. The invention does not need to label data, overcomes the uncertainty problem in image reconstruction and improves the accuracy and generalization capability of the model.

Description

Method for eliminating uncertainty of self-supervision three-dimensional reconstruction

Technical Field

The invention relates to the technical field of image three-dimensional reconstruction, in particular to a method for eliminating uncertainty of self-supervision three-dimensional reconstruction.

Background

Multi-view Stereo (MVS) aims to recover three-dimensional structural information of a scene from Multi-view images and camera poses. In the past decades, the traditional multi-view stereo vision method has made great progress, but the artificially designed feature descriptors lack robustness in estimating the matching relationship of the image pair and are easily interfered by factors such as noise or illumination.

In recent years, researchers have begun to introduce deep learning methods into the flow of MVS and have achieved significant performance improvements, such as MVSNet, R-MVSNet, and the like. The methods integrate the image matching process into an end-to-end network, input a series of multi-view images and camera parameters, and directly output a dense depth map. And then three-dimensional information of the whole scene is restored by fusing the depth maps at all the visual angles. However, in practical applications, these deep learning-based MVS methods have a huge drawback that a large-scale data set is required for training. The cost for collecting the three-dimensional labeled data is high, so that the wide application of the MVS method is limited. To get rid of the limitations of three-dimensional data labeling, researchers have begun to focus more on unsupervised or self-supervised MVS methods. The existing self-supervision MVS method mainly realizes the self-supervision training of a network by constructing an agent task based on an image reconstruction task, and in the mode, in order to ensure the luminosity stereo consistency assumption, a certain view angle image reconstructed by using a predicted depth image and other view angle images is ensured to be consistent with an original image.

However, in the prior art, the self-monitoring MVS method still lacks effective measures for the influence of uncertain factors such as color change and object occlusion, thereby affecting the quality of the reconstructed image.

Disclosure of Invention

The object of the present invention is to overcome the above mentioned drawbacks of the prior art and to provide a method for eliminating the uncertainty of an auto-supervised three-dimensional reconstruction, comprising the following steps:

step S1: pre-training a deep learning three-dimensional reconstruction model by taking a set first loss function as a target, wherein the deep learning three-dimensional reconstruction model takes a view angle pair image consisting of a reference view angle and a source view angle as input, extracting a corresponding depth map for three-dimensional image reconstruction, the first loss function is constructed on the basis of luminosity three-dimensional consistency loss and depth optical flow consistency loss, the luminosity three-dimensional consistency loss represents the difference between a reconstructed image and the reference image, and the depth optical flow consistency loss represents the pixel of the source view angle and pseudo optical flow information formed by a matching point of the source view angle under the reference view angle;

step S2: training the pre-trained deep learning three-dimensional reconstruction model by taking a set second loss function as an optimization target to obtain an optimized three-dimensional reconstruction model, wherein the second loss function is constructed by estimating an uncertainty mask of a pre-training stage, and the uncertainty mask is used for representing an effective region in an input image.

Compared with the prior art, the method has the advantages that in order to solve the problem of uncertainty of foreground supervision ambiguity, extra matching information is introduced by adopting cross-view optical flow and depth consistency constraint to strengthen the constraint action of an automatic supervision signal; in order to solve the problem of uncertainty of background invalid interference, an estimated uncertainty mask in an automatic supervision process is combined with a pseudo label, so that a region possibly introducing wrong supervision signals is effectively filtered, and the quality of a reconstructed image is improved.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a diagram illustrating the difference and uncertainty between fully supervised training and self supervised training in the prior art MVS;

FIG. 2 is a schematic diagram of a visual comparison of uncertainty of fully supervised and unsupervised signals in an MVS according to one embodiment of the present invention;

FIG. 3 is a flow diagram of a method of eliminating uncertainty in an unsupervised three-dimensional reconstruction according to one embodiment of the invention;

FIG. 4 is a process diagram of a method for eliminating uncertainty in an auto-supervised three-dimensional reconstruction according to one embodiment of the present invention;

FIG. 5 is a diagram illustrating the relative transformation relationship between depth information and cross-view optical flow, according to an embodiment of the present invention;

FIG. 6 is a schematic illustration of a visual analysis of the optical flow signal guided self-supervised pre-training effect according to one embodiment of the present invention;

FIG. 7 is a schematic diagram of a visual analysis of uncertainty mask guided self-supervised post-training efforts in accordance with one embodiment of the present invention;

fig. 8 is a schematic diagram illustrating an application process of a three-dimensional reconstruction model according to an embodiment of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

For clear understanding of the present invention, the uncertainty problem existing in the existing self-supervised three-dimensional reconstruction process is first analyzed. Referring to fig. 1, fig. 1(a) is a schematic diagram of an unsupervised training process, fig. 1(b) is a schematic diagram of an unsupervised training process, and fig. 1(c) is a schematic diagram of the degree of uncertainty in the supervised signals of the unsupervised training and the unsupervised training. Briefly, the auto-supervised MVS method replaces the depth labeling in the supervised approach by building an auto-supervised signal with a proxy task using image reconstruction. The intuitive interpretation of this approach is that if the depth value estimated by the network is correct, then the homography projection relationship determined according to the three-dimensional information of the depth value, and the image of another view reconstructed using the image of one view should coincide with the original image. Although the effectiveness of the self-supervision signal has been demonstrated, the prior art has merely explained the utility of the self-supervision signal based on intuition, and lacks a direct concrete explanation. For example, the self-supervision signal is active in which places of the image, inactive in which places, etc. To explain these problems, the cognitive uncertainty in the self-supervised training is visualized using the Monte-Carlo Dropout (or called MC Dropout) method to provide an intuitive explanation, as shown in fig. 1(c), it can be seen that there is more uncertainty in the background region and the boundary region of the image in the current self-supervised training compared to the fully supervised training.

To further analyze the cause of the uncertainty problem in the self-supervised training, fig. 2 visually compares the uncertainty of the fully supervised and the self-supervised signals (or unsupervised signals) in the MVS. Upon analysis, the uncertainty in existing self-supervision methods can be summarized in the following two types.

1) Uncertainty about foreground surveillance ambiguity. As shown in fig. 2(a), under the influence of some additional factors, such as color variation and object occlusion (i.e., delineation), the proxy supervision signal based on image reconstruction cannot satisfy photometric stereo consistency in MVS, resulting in that the self-supervision signal cannot contain correct depth information.

2) Uncertainty about background nulling interference. As shown in fig. 2(b), the non-textured area (i.e. the circled portion) in the image does not contain any effective matching clue, and is usually directly discarded in the fully supervised training. However, for the self-supervised training, since the whole image is included in the proxy loss calculation for image reconstruction, an invalid region such as a non-texture region is also included, which may introduce additional noise interference and invalid supervision signals, and further cause the problem of deep over-smoothing of the training result.

Aiming at the problem of uncertainty in the self-supervision method, the invention provides a method for eliminating uncertainty of self-supervision three-dimensional reconstruction, which is shown in combination with fig. 3 and 4 and specifically comprises the following steps.

And step S310, constructing a deep learning three-dimensional reconstruction model.

The basic process of performing the self-supervised three-dimensional reconstruction by using the deep learning is as follows: inputting the multi-view image into a depth estimation network for depth estimation, projecting the extracted feature map onto the same reference view through homography mapping, and constructing a matching error body (or called cost body) among the views under various depths, wherein the cost body predicts the depth map at the reference view; fusing the depth maps under each view angle together to reconstruct three-dimensional information of the whole scene; and then estimating the difference between the reconstructed image and the original image by using the self-supervision loss, and training the network until convergence.

The deep learning three-dimensional reconstruction model may employ various types of networks, such as MVSNet and R-MVSNet, among others. Referring to fig. 4, in an embodiment, the backbone network uses MVSNet, the input of the backbone network is N source view images, and the source view images are projected to the reference view images according to the camera external parameters, and the whole process is differentiable. And (4) counting the variance of the feature map of each view angle to construct a cost body, and further extracting features by adopting a 3D convolution network. In the bottleneck layer portion of the 3D convolutional network, a plurality of Monte-Carlo Dropout layers are embedded. The Monte-Carlo Dropout layer is frozen by default, and is only activated when the uncertainty mask needs to be estimated, the process at activation being described below.

It should be understood that any network modified based on MVSNet may replace the backbone network, and other types of three-dimensional reconstruction models may be used, and the present invention is not limited to the type and specific structure of the three-dimensional reconstruction model. In addition, the number of source view images may be set according to needs, such as 2-8, and the like, which is not limited by the present invention.

Step S320, a set loss function is taken as a target, a deep learning three-dimensional reconstruction model is subjected to self-supervision pre-training, and the loss function comprises photometric stereo consistency loss and deep optical flow consistency loss.

Still referring to fig. 4, in the self-supervised pre-training phase, two branches are logically included, and the upper branch takes the pair of the reference view and the source view as input for obtaining the corresponding depth map. For the lower-layer branch, on one hand, forward optical flow (namely optical flow from a reference view image to a source view image) and backward optical flow (namely optical flow from the source view image to the reference view image) are acquired for the image based on a pairwise pair of views formed by the reference view and the source view, on the other hand, virtual optical flow information or pseudo optical flow information is acquired for a depth map predicted by the upper-layer branch, and a pre-training model is evaluated by fusing the optical flow matching information of the two aspects.

Specifically, in the stage of self-supervision pre-training, in order to solve the uncertainty problem of foreground supervision ambiguity, in addition to the basic photometric stereo consistency loss, a depth-optical flow consistency loss is additionally added, and the robustness of the self-supervision signal is enhanced by introducing additional dense matching relation prior information of cross-view optical flow.

1) Loss of stereo consistency with respect to luminosity

Given that i ≦ 1 denotes the reference view, and j (2 ≦ j ≦ V) denotes the source view, V being the total number of views. For a pair of multi-view images(I₁,I_j) And its corresponding camera internal and external parameters ([ K ]₁,T₁],[K_j,T_j]). Output as a depth map D at a reference view₁. Thus, the pixel p in the source view j can be calculated_iCorresponding pixel in a reference view

Where i (1. ltoreq. i. ltoreq. HW) denotes the position index of the pixel in the image, H and W are the height and width of the image, D_jA depth map corresponding to a source view j is represented. And (4) normalizing the result by the homography formula (1) to obtain the coordinate in the corresponding image.

Wherein Norm ([ x, y, z)]^T)＝[x/z,y/z,1]^T。

Image reconstruction from source view j by means of a micro-bilinear interpolation operation

In addition, only a part of pixels have mapping relation in the process of reconstructing the image, and meanwhile, a binary mask M can be obtained_jAnd is used to represent the active area in the reconstructed image. In one embodiment, the difference between the reconstructed image and the reference image is compared in calculating the photometric stereo consistency loss according to the following formula:

wherein Δ represents the gradient of the image in the x and y directions,. indicates a dot-by-dot operation, I₁Which represents a reference image, is shown,

representing an image reconstructed based on the image of the source view j.

The use of photometric stereo consistency loss for self-supervised pre-training can ensure that photometric changes (such as gray values) between the reconstructed image and the reference image are as small as possible.

2) Cross-view optical flow-depth consistency loss

In order to solve the foreground surveillance ambiguity problem in the self-surveillance MVS, a new cross-view optical flow-depth consistency loss (namely, depth optical flow consistency loss) is further provided. Still referring to fig. 4, in one embodiment, the calculation of the loss comprises two sub-modules, an image-to-light-stream module and a depth-to-light-stream module. The depth-to-light flow module is completely differentiable, and can convert matching information contained in the depth map into a light flow map between a reference view angle and any view angle. The module can be embedded in any network. The image-to-optical flow module may use unsupervised methods to estimate optical flow information directly from the original image, including, for example, forward optical flow (e.g., I)₁->I₂，I₁->I₃) And backward light flow (e.g. I)₂->I₁，I₃->I₁). When calculating the optical flow-depth consistency loss, the optical flow graphs output by the two sub-modules are compared, and the results are required to be as similar as possible.

Specifically, a schematic diagram of the depth-to-optical flow module is shown in fig. 5. In the MVS system, images of different perspectives are acquired by moving the position of a camera by default, and depth information is restored according to the matching relationship of pixels between multiple views. In contrast, relative motion can be approximated by assuming that there is a virtual camera that does not move position, but rather considers an object to have relative motion. Intuitively, the matching information contained in the depth map may translate into matching information of a pseudo-light-flow map in such a relatively moving scene. The detailed derivation is as follows:

the virtual optical flow described above is defined as:

wherein

Pixel representing source view j

And its matching point p at the reference view angle_iThe resulting light flow. This can be obtained from the homography projection formula given above:

according to the above formula (4), the matching information in the depth map can be expressed in the form of a light flow map, and the whole process is completely differentiable.

And for the image-to-optical-stream module, dividing the reference visual angle and the rest source visual angles according to the current multi-visual-angle data set to form a pair of visual angle pairs. Using these multiple-view pairs, the constructed dataset is pre-trained unsupervised by an auto-supervised approach to an optical flow learning network (e.g., PWC-Net). In the image light stream conversion module, a pairwise visual angle pair consisting of any reference visual angle and a source visual angle is input, and a forward light stream F between the two visual angles is output_1j(reference viewing angle->Source view angle) and reverse light flow pattern F_j1(Source view angle->Reference view angle).

3) Computation of cross-perspective optical flow-depth consistency loss function.

Predicted depth map D in the depth-to-optical flow module₁Converted into virtual cross-view optical flow

In the image-to-optical flow module, the output is a forward optical flow F_1jAnd a backward luminous flux F_j1It should be in line with the virtual light flow

Of (2)The matching information is consistent.

First, for pixel points without occlusion, the forward optical flow F_1jGuided and reversed light flow F_j1The opposite value applies. To avoid interference of introducing an occlusion region when calculating loss, a forward optical flow F is firstly passed_1jWith flow of backward light F_j1Calculate the occlusion mask O_1jExpressed as:

O_1j＝{|F_1j+F_j1|＞∈} (5)

where e is a threshold value, which may be set according to the calculation accuracy requirement, e.g. 0.5.

Next, an optical flow-depth consistency loss can be computed, expressed as:

wherein, F_1j(p_i) Representing a pixel p in a forward light flow graph from a reference view to a source view j_iThe value of the optical flow of (a),

representing a pixel p in a virtual light flow graph from a reference view to a source view j_iThe optical flow value of (a).

Taking into account the noise of the optical flow graph itself, in this embodiment, the minimization of the error is used to select the pair of computation losses with the smallest error from all the pairs of views consisting of the reference view and the source view, which can reduce the effect of the noise of the optical flow graph itself.

4) Calculation of total losses in the self-supervised pre-training phase

In the self-supervision pre-training stage, in order to balance the photometric stereo consistency loss and the optical flow-depth consistency loss so as to improve the model training precision and the generalization capability, the two types of losses are fused to construct an overall loss, which is expressed as:

L_ssp＝L_pc+λL_fc (7)

wherein L is_pcRepresenting a loss of photometric stereo coherence, L_fcRepresenting the optical flow-depth uniformity loss, λ is a set constant, which can be set as desired, in order to balance the two loss scales, e.g., λ is 0.1.

Fig. 6 is a schematic illustration of a visual analysis of the effect of optical flow signal-guided self-supervised pre-training, wherein the left side is a schematic illustration without optical flow signal guidance and the right side is a schematic illustration with optical flow guidance. It can be seen that the occlusion mask is calculated by using the forward optical flow and the reverse optical flow, and then the occlusion mask is used for calculating the consistency loss of the depth optical flow, so that the interference of an occlusion area can be sensed. The additional matching relation introduced by using the optical flow can strengthen the constraint effect of the self-supervision signal, and the effective area of the self-supervision signal is enhanced.

Step S330, the pre-training model is further trained, and the uncertainty mask in the self-supervision process is estimated and combined with the pseudo label in the stage, so as to filter out the area introducing the wrong supervision signal.

After the pre-training model is obtained, further training may be performed, referred to herein as a pseudo-label post-training phase, in order to improve the accuracy and generalization ability of the model. In the post-pseudo-label training phase, uncertainty in the self-supervision process is first estimated, and then the estimated uncertainty is introduced into the self-training consistency loss to guide the training of the final model. In addition, random data enhanced multi-view images may be employed for training.

Specifically, in order to deal with the background noise interference problem, invalid regions, such as non-texture regions and the like, are filtered through uncertainty masks in a post-pseudo-label training stage. As these invalid regions do not contain any information useful for the self-supervision signal. For example, the uncertainty mask in the self-supervision process is estimated by activating Monte-Carlo Dropout and further added to the loss to achieve the effect of filtering invalid regions.

1) Estimation of uncertainty

In practical applications, the uncertainty describes the degree of doubt that the model is outputting. Monte-Carlo Dropout is added on the bottleneck layer of the model-populated 3D convolutional network, and to avoid model overfitting, the loss of the self-supervised pre-training phase is preferably modified accordingly to introduce a regularization term of uncertainty, expressed as:

wherein sigma²Is a contingent uncertainty that represents the noise contained in the data itself.

In one embodiment, a 6-level CNN (convolutional neural network) is used to predict a pixel-by-pixel contingent uncertainty map. The loss function of the aforementioned auto-supervised pre-training phase is then modified according to equation (8) above so that it supports the training process of uncertainty estimation.

The stochastic Monte-Carlo Dropout is actually equivalent to sampling different model weights: w_t～q_θ(W, t) wherein q_θ(W, t) represents the distribution to which Dropout follows. Defining the weight of the model of the t-th sampling as W_tThe predicted depth map is D_1,t. The (cognitive) uncertainty of the model can then be estimated by:

wherein

Is the result of sampling, σ_tThe occasional uncertainty plot corresponding to the T-th sample is shown, and T represents the number of samples. In one embodiment, the mean of the T samples is used as a pseudo label:

compared with a Bayesian neural network, the Bayesian network is approximately simulated by embedding the Monte-Carlo Dropout layer in the embodiment of the invention, so that the calculation cost can be greatly reduced. In one practical application, the dropout rate may be set to 0.5, with the greater the number of samples closer to the ideal case, e.g., the default sampling of 20.

2) Self-training consistency loss with respect to uncertainty perception

In order to alleviate the interference of the area with larger uncertainty, the generated pseudo label and uncertainty mask are used to construct the uncertainty-aware self-training consistency loss (or uncertainty-known self-training consistency loss).

First, a binary mask is calculated based on the learned uncertainty

Expressed as:

where ξ is a set threshold, it can be set according to the accuracy requirement for uncertainty estimation, e.g., ξ ═ 0.3.

Then, a self-training consistency loss is calculated, expressed as:

wherein D_1,τThe depth map of the network prediction after random data enhancement is shown. In the embodiment of the invention, the adopted data enhancement does not comprise position transformation, and only comprises data enhancement strategies such as random illumination change, color disturbance, shielding masks and the like.

Through verification, the uncertainty mask in the self-supervision process is estimated by using Monte-Carlo Dropout and is combined with the pseudo label, so that the noise supervision signal possibly existing in the pseudo label can be effectively inhibited. FIG. 7 is a schematic of a visual analysis of uncertainty mask guided self-supervised post-training efforts, with the effect without uncertainty guidance on the left and the effect with uncertainty guidance on the right. It can be seen that, compared with the method of directly using the pseudo label with uncertain results for training, the self-supervision post-training guided by the uncertainty mask of the invention can effectively filter the area which may introduce wrong supervision signals.

In summary, the self-supervised MVS framework proposed by the present invention is generally divided into a self-supervised pre-training phase and a pseudo label post-training phase. In the self-supervised pre-training phase, L is adopted_sspAnd (6) performing calculation. L can also be used since the subsequent stage needs to introduce Monte-Carlo Dropout and uncertainty estimation_sspIs modified to L'_ssp. In the post-pseudo-label training stage, firstly, a model obtained by Monte-Carlo and pre-training is used for estimating a pseudo label and an uncertainty mask, and then, the self-training loss L is calculated_ucAnd obtaining a final model.

Compared with the prior art, the method and the device can be better suitable for natural scenes. Intuitively, the uncertainty estimated by the present invention naturally includes various noises, occlusion changes, or non-texture areas in the background in the natural scene. In the self-supervision training process, the influence of the uncertain factors on the supervision process can be effectively inhibited, and a better training effect is ensured. From the experimental point of view, the deep learning three-dimensional reconstruction model trained by the invention can achieve a leading effect on the open natural scene three-dimensional reconstruction data set (Tanks and Temples) without any fine adjustment. The results are shown in table 1 below, where the last row is the effect of the invention on multiple types of data sets and the other rows are the effects of the prior art. The second column shows whether real three-dimensional labeling is adopted for model training, the third column shows the grade of the three-dimensional reconstruction effect in a real scene, the grade is provided by an online evaluation website of the data set, and the larger the value is, the better the value is. The fourth to eleventh columns represent the reconstruction result scores in eight different real scenes.

Table 1: data set effect comparison

In addition, compared with a method with supervision and requiring labeling of a data set, the method adopts completely unsupervised training, and the whole training process only uses original multi-view images and camera parameters and does not need any three-dimensional information labeling. And the optical flow matching information and uncertainty estimation are combined to guide the training process, and the obtained optimization model realizes the reconstruction performance which is not weaker than that of the reconstruction performance which is even stronger than that of a supervised mode in some scenes. The final model obtained by the invention can be used for three-dimensional image reconstruction of various scenes, such as embedding into electronic equipment, including but not limited to mobile phones, terminals, wearable equipment, computer equipment and the like. Referring to fig. 8, the basic procedure for the terminal is: a user opens an application program on the terminal equipment; recording and uploading a video; the terminal equipment intercepts a video into a plurality of frames and constructs a multi-view image pair; solving the camera pose (Bundle Adjustment) according to the camera internal reference and the multi-view image pair; carrying out depth estimation on the multi-view image through the trained deep learning three-dimensional reconstruction model; the depth information under multiple visual angles is fused to obtain three-dimensional information of a scene; the terminal device displays the three-dimensional model to a user.

In summary, the present invention focuses on the uncertainty problem in the self-supervised MVS, and the direct evaluation criteria thereof include some interference signal conditions in the natural scene, i.e. uncertainties in the foreground and background, such as occlusion, illumination variation, and non-texture background. The conventional training cannot effectively handle the two uncertain situations, because the areas containing wrong supervision signals are directly included in the training process, and the final effect is inevitably influenced. And because the generalization capability is enhanced, the method can be applied to a cross-data set scene.

It is to be understood that appropriate modifications or changes may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. For example, besides the Monte-Carlo Dropout method for estimating the uncertainty in the self-supervision process, a bayesian network can be applied, but in the actual training process, the training cost of the bayesian network is high, the bayesian network is difficult to embed into the network of the framework of the invention, and the bayesian network model is large and cannot be placed in a common GPU (1080/2080Ti) for training. To this end, the present invention preferably models the process of Bayesian network sampling approximately by embedding Dropout, using the Monte-Carlo Dropout approach, to reduce model size and computational cost of uncertainty estimation. For another example, the total loss from the training phase may also be weighted in other ways, such as exponentially.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + +, Python, or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A method for eliminating uncertainty of self-supervision three-dimensional reconstruction comprises the following steps:

2. The method of claim 1, wherein the first loss function is set to:

L_ssp＝L_pc+λL_fc

wherein L is_pcRepresenting a loss of photometric stereo coherence, L_fcIndicating the depth optical flow uniformity loss, λ is a set constant.

3. The method of claim 2, wherein the photometric stereo consistency loss is calculated according to the following steps:

computing a pixel p in a source view j_iCorresponding pixel in a reference view

Expressed as:

normalizing the result to obtain the coordinates in the corresponding image:

obtaining corresponding reconstructed image based on image of source view j through micro bilinear interpolation operation

And in the process of reconstructing the image, obtaining a binary mask M_jFor representing the active area in the reconstructed image;

according to the obtained binary mask M_jThe photometric stereo consistency loss was calculated as:

wherein Δ represents gradients of an image in x and y directions, [ indicates a dot-by-dot operation, [ i ] (1 ≦ i ≦ HW) represents a position index of a pixel in the image, H and W represent a height and a width of the image, respectively, [ K ]₁,T₁],[K_j,T_j]Is a pair of multi-view images (I)₁,I_j) Corresponding internal and external parameters of the camera, I₁A reference view angle picture is represented,

representing an image reconstructed based on the image of the source view j.

4. The method of claim 2, wherein the depth-optic flow consistency loss is calculated according to the steps of:

the method comprises the steps of utilizing a data set to pre-train an optical flow learning network, inputting a pairwise visual angle pair consisting of a reference visual angle and a source visual angle, and outputting a forward optical flow graph F between the reference visual angle and the source visual angle_1jAnd a reverse light flow pattern F_j1；

Depth map D to be predicted₁Conversion to virtual cross-view optical flow

By forward flow of light F_1jWith flow of backward light F_j1Calculate the occlusion mask O_1jExpressed as:

O_1j＝{|F_1j+F_j1|＞∈}

a depth optical flow consistency loss is calculated, expressed as follows:

where e is a set threshold, i (1 ≦ i ≦ HW) represents the position index of the pixel in the image, H and W represent the height and width of the image, respectively, p_iIs a pixel in the source view j, D₁Is a depth map at a reference view predicted to correspond to a pair of multi-view images, F_1j(p_i) Representing a pixel p in a forward light flow graph from a reference view to a source view j_iThe value of the optical flow of (a),

5. The method according to claim 1, characterized in that a Monte-card discarding layer is arranged on a bottleneck layer of the deep learning three-dimensional reconstruction model for estimating uncertainty in a pre-training process through multiple sampling.

6. The method of claim 5, wherein the second loss function is calculated according to the following steps:

by sampling different pre-training model weights, the uncertainty of the model is estimated, expressed as:

wherein

Is the result of sampling, T is the number of samples, D_1,tIs the depth map of the t-th sample prediction;

using the mean of T samples as a pseudo label, it is expressed as:

computing a binary mask from the estimated uncertainty

Constructing the second loss function using the generated pseudo-label and the uncertainty binary mask, expressed as:

where ξ denotes a set threshold value, D_1,τDepth map, σ, representing the prediction in step S2_tRepresenting the occasional uncertainty plot corresponding to the t-th sample.

7. The method of claim 2, wherein the first loss function is modified to:

wherein σ²Is a contingent uncertainty.

8. A method of reconstructing a three-dimensional image, comprising the steps of:

constructing a multi-view image pair by using the shot images;

solving the camera pose according to the camera internal reference and the multi-view image;

inputting a multi-view image pair into an optimized three-dimensional reconstruction model obtained according to the method of any one of claims 1 to 7 for depth estimation of the multi-view image;

and (4) integrating the depth information under multiple visual angles to obtain the three-dimensional information of the scene, and further obtaining an image three-dimensional model.

9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7 or 8.

10. A computer device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 7 or 8 when executing the program.