Nothing Special   »   [go: up one dir, main page]

CN113592913A - Method for eliminating uncertainty of self-supervision three-dimensional reconstruction - Google Patents

Method for eliminating uncertainty of self-supervision three-dimensional reconstruction Download PDF

Info

Publication number
CN113592913A
CN113592913A CN202110907900.4A CN202110907900A CN113592913A CN 113592913 A CN113592913 A CN 113592913A CN 202110907900 A CN202110907900 A CN 202110907900A CN 113592913 A CN113592913 A CN 113592913A
Authority
CN
China
Prior art keywords
image
uncertainty
view
optical flow
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110907900.4A
Other languages
Chinese (zh)
Other versions
CN113592913B (en
Inventor
许鸿斌
周志鹏
乔宇
康文雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202110907900.4A priority Critical patent/CN113592913B/en
Publication of CN113592913A publication Critical patent/CN113592913A/en
Application granted granted Critical
Publication of CN113592913B publication Critical patent/CN113592913B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/251Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Geometry (AREA)
  • Computer Graphics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for eliminating uncertainty of self-supervision three-dimensional reconstruction. The method comprises the following steps: pre-training a deep learning three-dimensional reconstruction model by taking a set first loss function as a target, wherein the deep learning three-dimensional reconstruction model takes a view angle pair image consisting of a reference view angle and a source view angle as input, the first loss function is constructed on the basis of photometric stereo consistency loss and deep optical flow consistency loss, and the deep optical flow consistency loss represents pixels of the source view angle and pseudo optical flow information formed by matching points of the pixels under the reference view angle; training a pre-trained deep learning three-dimensional reconstruction model by taking a set second loss function as an optimization target, wherein the second loss function is constructed by estimating an uncertainty mask in a pre-training stage, and the uncertainty mask is used for representing an effective region in an input image. The invention does not need to label data, overcomes the uncertainty problem in image reconstruction and improves the accuracy and generalization capability of the model.

Description

Method for eliminating uncertainty of self-supervision three-dimensional reconstruction
Technical Field
The invention relates to the technical field of image three-dimensional reconstruction, in particular to a method for eliminating uncertainty of self-supervision three-dimensional reconstruction.
Background
Multi-view Stereo (MVS) aims to recover three-dimensional structural information of a scene from Multi-view images and camera poses. In the past decades, the traditional multi-view stereo vision method has made great progress, but the artificially designed feature descriptors lack robustness in estimating the matching relationship of the image pair and are easily interfered by factors such as noise or illumination.
In recent years, researchers have begun to introduce deep learning methods into the flow of MVS and have achieved significant performance improvements, such as MVSNet, R-MVSNet, and the like. The methods integrate the image matching process into an end-to-end network, input a series of multi-view images and camera parameters, and directly output a dense depth map. And then three-dimensional information of the whole scene is restored by fusing the depth maps at all the visual angles. However, in practical applications, these deep learning-based MVS methods have a huge drawback that a large-scale data set is required for training. The cost for collecting the three-dimensional labeled data is high, so that the wide application of the MVS method is limited. To get rid of the limitations of three-dimensional data labeling, researchers have begun to focus more on unsupervised or self-supervised MVS methods. The existing self-supervision MVS method mainly realizes the self-supervision training of a network by constructing an agent task based on an image reconstruction task, and in the mode, in order to ensure the luminosity stereo consistency assumption, a certain view angle image reconstructed by using a predicted depth image and other view angle images is ensured to be consistent with an original image.
However, in the prior art, the self-monitoring MVS method still lacks effective measures for the influence of uncertain factors such as color change and object occlusion, thereby affecting the quality of the reconstructed image.
Disclosure of Invention
The object of the present invention is to overcome the above mentioned drawbacks of the prior art and to provide a method for eliminating the uncertainty of an auto-supervised three-dimensional reconstruction, comprising the following steps:
step S1: pre-training a deep learning three-dimensional reconstruction model by taking a set first loss function as a target, wherein the deep learning three-dimensional reconstruction model takes a view angle pair image consisting of a reference view angle and a source view angle as input, extracting a corresponding depth map for three-dimensional image reconstruction, the first loss function is constructed on the basis of luminosity three-dimensional consistency loss and depth optical flow consistency loss, the luminosity three-dimensional consistency loss represents the difference between a reconstructed image and the reference image, and the depth optical flow consistency loss represents the pixel of the source view angle and pseudo optical flow information formed by a matching point of the source view angle under the reference view angle;
step S2: training the pre-trained deep learning three-dimensional reconstruction model by taking a set second loss function as an optimization target to obtain an optimized three-dimensional reconstruction model, wherein the second loss function is constructed by estimating an uncertainty mask of a pre-training stage, and the uncertainty mask is used for representing an effective region in an input image.
Compared with the prior art, the method has the advantages that in order to solve the problem of uncertainty of foreground supervision ambiguity, extra matching information is introduced by adopting cross-view optical flow and depth consistency constraint to strengthen the constraint action of an automatic supervision signal; in order to solve the problem of uncertainty of background invalid interference, an estimated uncertainty mask in an automatic supervision process is combined with a pseudo label, so that a region possibly introducing wrong supervision signals is effectively filtered, and the quality of a reconstructed image is improved.
Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a diagram illustrating the difference and uncertainty between fully supervised training and self supervised training in the prior art MVS;
FIG. 2 is a schematic diagram of a visual comparison of uncertainty of fully supervised and unsupervised signals in an MVS according to one embodiment of the present invention;
FIG. 3 is a flow diagram of a method of eliminating uncertainty in an unsupervised three-dimensional reconstruction according to one embodiment of the invention;
FIG. 4 is a process diagram of a method for eliminating uncertainty in an auto-supervised three-dimensional reconstruction according to one embodiment of the present invention;
FIG. 5 is a diagram illustrating the relative transformation relationship between depth information and cross-view optical flow, according to an embodiment of the present invention;
FIG. 6 is a schematic illustration of a visual analysis of the optical flow signal guided self-supervised pre-training effect according to one embodiment of the present invention;
FIG. 7 is a schematic diagram of a visual analysis of uncertainty mask guided self-supervised post-training efforts in accordance with one embodiment of the present invention;
fig. 8 is a schematic diagram illustrating an application process of a three-dimensional reconstruction model according to an embodiment of the present invention.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
For clear understanding of the present invention, the uncertainty problem existing in the existing self-supervised three-dimensional reconstruction process is first analyzed. Referring to fig. 1, fig. 1(a) is a schematic diagram of an unsupervised training process, fig. 1(b) is a schematic diagram of an unsupervised training process, and fig. 1(c) is a schematic diagram of the degree of uncertainty in the supervised signals of the unsupervised training and the unsupervised training. Briefly, the auto-supervised MVS method replaces the depth labeling in the supervised approach by building an auto-supervised signal with a proxy task using image reconstruction. The intuitive interpretation of this approach is that if the depth value estimated by the network is correct, then the homography projection relationship determined according to the three-dimensional information of the depth value, and the image of another view reconstructed using the image of one view should coincide with the original image. Although the effectiveness of the self-supervision signal has been demonstrated, the prior art has merely explained the utility of the self-supervision signal based on intuition, and lacks a direct concrete explanation. For example, the self-supervision signal is active in which places of the image, inactive in which places, etc. To explain these problems, the cognitive uncertainty in the self-supervised training is visualized using the Monte-Carlo Dropout (or called MC Dropout) method to provide an intuitive explanation, as shown in fig. 1(c), it can be seen that there is more uncertainty in the background region and the boundary region of the image in the current self-supervised training compared to the fully supervised training.
To further analyze the cause of the uncertainty problem in the self-supervised training, fig. 2 visually compares the uncertainty of the fully supervised and the self-supervised signals (or unsupervised signals) in the MVS. Upon analysis, the uncertainty in existing self-supervision methods can be summarized in the following two types.
1) Uncertainty about foreground surveillance ambiguity. As shown in fig. 2(a), under the influence of some additional factors, such as color variation and object occlusion (i.e., delineation), the proxy supervision signal based on image reconstruction cannot satisfy photometric stereo consistency in MVS, resulting in that the self-supervision signal cannot contain correct depth information.
2) Uncertainty about background nulling interference. As shown in fig. 2(b), the non-textured area (i.e. the circled portion) in the image does not contain any effective matching clue, and is usually directly discarded in the fully supervised training. However, for the self-supervised training, since the whole image is included in the proxy loss calculation for image reconstruction, an invalid region such as a non-texture region is also included, which may introduce additional noise interference and invalid supervision signals, and further cause the problem of deep over-smoothing of the training result.
Aiming at the problem of uncertainty in the self-supervision method, the invention provides a method for eliminating uncertainty of self-supervision three-dimensional reconstruction, which is shown in combination with fig. 3 and 4 and specifically comprises the following steps.
And step S310, constructing a deep learning three-dimensional reconstruction model.
The basic process of performing the self-supervised three-dimensional reconstruction by using the deep learning is as follows: inputting the multi-view image into a depth estimation network for depth estimation, projecting the extracted feature map onto the same reference view through homography mapping, and constructing a matching error body (or called cost body) among the views under various depths, wherein the cost body predicts the depth map at the reference view; fusing the depth maps under each view angle together to reconstruct three-dimensional information of the whole scene; and then estimating the difference between the reconstructed image and the original image by using the self-supervision loss, and training the network until convergence.
The deep learning three-dimensional reconstruction model may employ various types of networks, such as MVSNet and R-MVSNet, among others. Referring to fig. 4, in an embodiment, the backbone network uses MVSNet, the input of the backbone network is N source view images, and the source view images are projected to the reference view images according to the camera external parameters, and the whole process is differentiable. And (4) counting the variance of the feature map of each view angle to construct a cost body, and further extracting features by adopting a 3D convolution network. In the bottleneck layer portion of the 3D convolutional network, a plurality of Monte-Carlo Dropout layers are embedded. The Monte-Carlo Dropout layer is frozen by default, and is only activated when the uncertainty mask needs to be estimated, the process at activation being described below.
It should be understood that any network modified based on MVSNet may replace the backbone network, and other types of three-dimensional reconstruction models may be used, and the present invention is not limited to the type and specific structure of the three-dimensional reconstruction model. In addition, the number of source view images may be set according to needs, such as 2-8, and the like, which is not limited by the present invention.
Step S320, a set loss function is taken as a target, a deep learning three-dimensional reconstruction model is subjected to self-supervision pre-training, and the loss function comprises photometric stereo consistency loss and deep optical flow consistency loss.
Still referring to fig. 4, in the self-supervised pre-training phase, two branches are logically included, and the upper branch takes the pair of the reference view and the source view as input for obtaining the corresponding depth map. For the lower-layer branch, on one hand, forward optical flow (namely optical flow from a reference view image to a source view image) and backward optical flow (namely optical flow from the source view image to the reference view image) are acquired for the image based on a pairwise pair of views formed by the reference view and the source view, on the other hand, virtual optical flow information or pseudo optical flow information is acquired for a depth map predicted by the upper-layer branch, and a pre-training model is evaluated by fusing the optical flow matching information of the two aspects.
Specifically, in the stage of self-supervision pre-training, in order to solve the uncertainty problem of foreground supervision ambiguity, in addition to the basic photometric stereo consistency loss, a depth-optical flow consistency loss is additionally added, and the robustness of the self-supervision signal is enhanced by introducing additional dense matching relation prior information of cross-view optical flow.
1) Loss of stereo consistency with respect to luminosity
Given that i ≦ 1 denotes the reference view, and j (2 ≦ j ≦ V) denotes the source view, V being the total number of views. For a pair of multi-view images(I1,Ij) And its corresponding camera internal and external parameters ([ K ]1,T1],[Kj,Tj]). Output as a depth map D at a reference view1. Thus, the pixel p in the source view j can be calculatediCorresponding pixel in a reference view
Figure BDA0003202409280000061
Figure BDA0003202409280000062
Where i (1. ltoreq. i. ltoreq. HW) denotes the position index of the pixel in the image, H and W are the height and width of the image, DjA depth map corresponding to a source view j is represented. And (4) normalizing the result by the homography formula (1) to obtain the coordinate in the corresponding image.
Figure BDA0003202409280000063
Wherein Norm ([ x, y, z)]T)=[x/z,y/z,1]T
Image reconstruction from source view j by means of a micro-bilinear interpolation operation
Figure BDA0003202409280000064
In addition, only a part of pixels have mapping relation in the process of reconstructing the image, and meanwhile, a binary mask M can be obtainedjAnd is used to represent the active area in the reconstructed image. In one embodiment, the difference between the reconstructed image and the reference image is compared in calculating the photometric stereo consistency loss according to the following formula:
Figure BDA0003202409280000065
wherein Δ represents the gradient of the image in the x and y directions,. indicates a dot-by-dot operation, I1Which represents a reference image, is shown,
Figure BDA0003202409280000066
representing an image reconstructed based on the image of the source view j.
The use of photometric stereo consistency loss for self-supervised pre-training can ensure that photometric changes (such as gray values) between the reconstructed image and the reference image are as small as possible.
2) Cross-view optical flow-depth consistency loss
In order to solve the foreground surveillance ambiguity problem in the self-surveillance MVS, a new cross-view optical flow-depth consistency loss (namely, depth optical flow consistency loss) is further provided. Still referring to fig. 4, in one embodiment, the calculation of the loss comprises two sub-modules, an image-to-light-stream module and a depth-to-light-stream module. The depth-to-light flow module is completely differentiable, and can convert matching information contained in the depth map into a light flow map between a reference view angle and any view angle. The module can be embedded in any network. The image-to-optical flow module may use unsupervised methods to estimate optical flow information directly from the original image, including, for example, forward optical flow (e.g., I)1->I2,I1->I3) And backward light flow (e.g. I)2->I1,I3->I1). When calculating the optical flow-depth consistency loss, the optical flow graphs output by the two sub-modules are compared, and the results are required to be as similar as possible.
Specifically, a schematic diagram of the depth-to-optical flow module is shown in fig. 5. In the MVS system, images of different perspectives are acquired by moving the position of a camera by default, and depth information is restored according to the matching relationship of pixels between multiple views. In contrast, relative motion can be approximated by assuming that there is a virtual camera that does not move position, but rather considers an object to have relative motion. Intuitively, the matching information contained in the depth map may translate into matching information of a pseudo-light-flow map in such a relatively moving scene. The detailed derivation is as follows:
the virtual optical flow described above is defined as:
Figure BDA0003202409280000071
wherein
Figure BDA0003202409280000072
Pixel representing source view j
Figure BDA0003202409280000073
And its matching point p at the reference view angleiThe resulting light flow. This can be obtained from the homography projection formula given above:
Figure BDA0003202409280000074
according to the above formula (4), the matching information in the depth map can be expressed in the form of a light flow map, and the whole process is completely differentiable.
And for the image-to-optical-stream module, dividing the reference visual angle and the rest source visual angles according to the current multi-visual-angle data set to form a pair of visual angle pairs. Using these multiple-view pairs, the constructed dataset is pre-trained unsupervised by an auto-supervised approach to an optical flow learning network (e.g., PWC-Net). In the image light stream conversion module, a pairwise visual angle pair consisting of any reference visual angle and a source visual angle is input, and a forward light stream F between the two visual angles is output1j(reference viewing angle->Source view angle) and reverse light flow pattern Fj1(Source view angle->Reference view angle).
3) Computation of cross-perspective optical flow-depth consistency loss function.
Predicted depth map D in the depth-to-optical flow module1Converted into virtual cross-view optical flow
Figure BDA0003202409280000075
In the image-to-optical flow module, the output is a forward optical flow F1jAnd a backward luminous flux Fj1It should be in line with the virtual light flow
Figure BDA0003202409280000076
Of (2)The matching information is consistent.
First, for pixel points without occlusion, the forward optical flow F1jGuided and reversed light flow Fj1The opposite value applies. To avoid interference of introducing an occlusion region when calculating loss, a forward optical flow F is firstly passed1jWith flow of backward light Fj1Calculate the occlusion mask O1jExpressed as:
O1j={|F1j+Fj1|>∈} (5)
where e is a threshold value, which may be set according to the calculation accuracy requirement, e.g. 0.5.
Next, an optical flow-depth consistency loss can be computed, expressed as:
Figure BDA0003202409280000081
wherein, F1j(pi) Representing a pixel p in a forward light flow graph from a reference view to a source view jiThe value of the optical flow of (a),
Figure BDA0003202409280000082
representing a pixel p in a virtual light flow graph from a reference view to a source view jiThe optical flow value of (a).
Taking into account the noise of the optical flow graph itself, in this embodiment, the minimization of the error is used to select the pair of computation losses with the smallest error from all the pairs of views consisting of the reference view and the source view, which can reduce the effect of the noise of the optical flow graph itself.
4) Calculation of total losses in the self-supervised pre-training phase
In the self-supervision pre-training stage, in order to balance the photometric stereo consistency loss and the optical flow-depth consistency loss so as to improve the model training precision and the generalization capability, the two types of losses are fused to construct an overall loss, which is expressed as:
Lssp=Lpc+λLfc (7)
wherein L ispcRepresenting a loss of photometric stereo coherence, LfcRepresenting the optical flow-depth uniformity loss, λ is a set constant, which can be set as desired, in order to balance the two loss scales, e.g., λ is 0.1.
Fig. 6 is a schematic illustration of a visual analysis of the effect of optical flow signal-guided self-supervised pre-training, wherein the left side is a schematic illustration without optical flow signal guidance and the right side is a schematic illustration with optical flow guidance. It can be seen that the occlusion mask is calculated by using the forward optical flow and the reverse optical flow, and then the occlusion mask is used for calculating the consistency loss of the depth optical flow, so that the interference of an occlusion area can be sensed. The additional matching relation introduced by using the optical flow can strengthen the constraint effect of the self-supervision signal, and the effective area of the self-supervision signal is enhanced.
Step S330, the pre-training model is further trained, and the uncertainty mask in the self-supervision process is estimated and combined with the pseudo label in the stage, so as to filter out the area introducing the wrong supervision signal.
After the pre-training model is obtained, further training may be performed, referred to herein as a pseudo-label post-training phase, in order to improve the accuracy and generalization ability of the model. In the post-pseudo-label training phase, uncertainty in the self-supervision process is first estimated, and then the estimated uncertainty is introduced into the self-training consistency loss to guide the training of the final model. In addition, random data enhanced multi-view images may be employed for training.
Specifically, in order to deal with the background noise interference problem, invalid regions, such as non-texture regions and the like, are filtered through uncertainty masks in a post-pseudo-label training stage. As these invalid regions do not contain any information useful for the self-supervision signal. For example, the uncertainty mask in the self-supervision process is estimated by activating Monte-Carlo Dropout and further added to the loss to achieve the effect of filtering invalid regions.
1) Estimation of uncertainty
In practical applications, the uncertainty describes the degree of doubt that the model is outputting. Monte-Carlo Dropout is added on the bottleneck layer of the model-populated 3D convolutional network, and to avoid model overfitting, the loss of the self-supervised pre-training phase is preferably modified accordingly to introduce a regularization term of uncertainty, expressed as:
Figure BDA0003202409280000091
wherein sigma2Is a contingent uncertainty that represents the noise contained in the data itself.
In one embodiment, a 6-level CNN (convolutional neural network) is used to predict a pixel-by-pixel contingent uncertainty map. The loss function of the aforementioned auto-supervised pre-training phase is then modified according to equation (8) above so that it supports the training process of uncertainty estimation.
The stochastic Monte-Carlo Dropout is actually equivalent to sampling different model weights: wt~qθ(W, t) wherein qθ(W, t) represents the distribution to which Dropout follows. Defining the weight of the model of the t-th sampling as WtThe predicted depth map is D1,t. The (cognitive) uncertainty of the model can then be estimated by:
Figure BDA0003202409280000092
wherein
Figure BDA0003202409280000093
Is the result of sampling, σtThe occasional uncertainty plot corresponding to the T-th sample is shown, and T represents the number of samples. In one embodiment, the mean of the T samples is used as a pseudo label:
Figure BDA0003202409280000094
compared with a Bayesian neural network, the Bayesian network is approximately simulated by embedding the Monte-Carlo Dropout layer in the embodiment of the invention, so that the calculation cost can be greatly reduced. In one practical application, the dropout rate may be set to 0.5, with the greater the number of samples closer to the ideal case, e.g., the default sampling of 20.
2) Self-training consistency loss with respect to uncertainty perception
In order to alleviate the interference of the area with larger uncertainty, the generated pseudo label and uncertainty mask are used to construct the uncertainty-aware self-training consistency loss (or uncertainty-known self-training consistency loss).
First, a binary mask is calculated based on the learned uncertainty
Figure BDA0003202409280000101
Expressed as:
Figure BDA0003202409280000102
where ξ is a set threshold, it can be set according to the accuracy requirement for uncertainty estimation, e.g., ξ ═ 0.3.
Then, a self-training consistency loss is calculated, expressed as:
Figure BDA0003202409280000103
wherein D1,τThe depth map of the network prediction after random data enhancement is shown. In the embodiment of the invention, the adopted data enhancement does not comprise position transformation, and only comprises data enhancement strategies such as random illumination change, color disturbance, shielding masks and the like.
Through verification, the uncertainty mask in the self-supervision process is estimated by using Monte-Carlo Dropout and is combined with the pseudo label, so that the noise supervision signal possibly existing in the pseudo label can be effectively inhibited. FIG. 7 is a schematic of a visual analysis of uncertainty mask guided self-supervised post-training efforts, with the effect without uncertainty guidance on the left and the effect with uncertainty guidance on the right. It can be seen that, compared with the method of directly using the pseudo label with uncertain results for training, the self-supervision post-training guided by the uncertainty mask of the invention can effectively filter the area which may introduce wrong supervision signals.
In summary, the self-supervised MVS framework proposed by the present invention is generally divided into a self-supervised pre-training phase and a pseudo label post-training phase. In the self-supervised pre-training phase, L is adoptedsspAnd (6) performing calculation. L can also be used since the subsequent stage needs to introduce Monte-Carlo Dropout and uncertainty estimationsspIs modified to L'ssp. In the post-pseudo-label training stage, firstly, a model obtained by Monte-Carlo and pre-training is used for estimating a pseudo label and an uncertainty mask, and then, the self-training loss L is calculateducAnd obtaining a final model.
Compared with the prior art, the method and the device can be better suitable for natural scenes. Intuitively, the uncertainty estimated by the present invention naturally includes various noises, occlusion changes, or non-texture areas in the background in the natural scene. In the self-supervision training process, the influence of the uncertain factors on the supervision process can be effectively inhibited, and a better training effect is ensured. From the experimental point of view, the deep learning three-dimensional reconstruction model trained by the invention can achieve a leading effect on the open natural scene three-dimensional reconstruction data set (Tanks and Temples) without any fine adjustment. The results are shown in table 1 below, where the last row is the effect of the invention on multiple types of data sets and the other rows are the effects of the prior art. The second column shows whether real three-dimensional labeling is adopted for model training, the third column shows the grade of the three-dimensional reconstruction effect in a real scene, the grade is provided by an online evaluation website of the data set, and the larger the value is, the better the value is. The fourth to eleventh columns represent the reconstruction result scores in eight different real scenes.
Table 1: data set effect comparison
Figure BDA0003202409280000111
In addition, compared with a method with supervision and requiring labeling of a data set, the method adopts completely unsupervised training, and the whole training process only uses original multi-view images and camera parameters and does not need any three-dimensional information labeling. And the optical flow matching information and uncertainty estimation are combined to guide the training process, and the obtained optimization model realizes the reconstruction performance which is not weaker than that of the reconstruction performance which is even stronger than that of a supervised mode in some scenes. The final model obtained by the invention can be used for three-dimensional image reconstruction of various scenes, such as embedding into electronic equipment, including but not limited to mobile phones, terminals, wearable equipment, computer equipment and the like. Referring to fig. 8, the basic procedure for the terminal is: a user opens an application program on the terminal equipment; recording and uploading a video; the terminal equipment intercepts a video into a plurality of frames and constructs a multi-view image pair; solving the camera pose (Bundle Adjustment) according to the camera internal reference and the multi-view image pair; carrying out depth estimation on the multi-view image through the trained deep learning three-dimensional reconstruction model; the depth information under multiple visual angles is fused to obtain three-dimensional information of a scene; the terminal device displays the three-dimensional model to a user.
In summary, the present invention focuses on the uncertainty problem in the self-supervised MVS, and the direct evaluation criteria thereof include some interference signal conditions in the natural scene, i.e. uncertainties in the foreground and background, such as occlusion, illumination variation, and non-texture background. The conventional training cannot effectively handle the two uncertain situations, because the areas containing wrong supervision signals are directly included in the training process, and the final effect is inevitably influenced. And because the generalization capability is enhanced, the method can be applied to a cross-data set scene.
It is to be understood that appropriate modifications or changes may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. For example, besides the Monte-Carlo Dropout method for estimating the uncertainty in the self-supervision process, a bayesian network can be applied, but in the actual training process, the training cost of the bayesian network is high, the bayesian network is difficult to embed into the network of the framework of the invention, and the bayesian network model is large and cannot be placed in a common GPU (1080/2080Ti) for training. To this end, the present invention preferably models the process of Bayesian network sampling approximately by embedding Dropout, using the Monte-Carlo Dropout approach, to reduce model size and computational cost of uncertainty estimation. For another example, the total loss from the training phase may also be weighted in other ways, such as exponentially.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + +, Python, or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims (10)

1. A method for eliminating uncertainty of self-supervision three-dimensional reconstruction comprises the following steps:
step S1: pre-training a deep learning three-dimensional reconstruction model by taking a set first loss function as a target, wherein the deep learning three-dimensional reconstruction model takes a view angle pair image consisting of a reference view angle and a source view angle as input, extracting a corresponding depth map for three-dimensional image reconstruction, the first loss function is constructed on the basis of luminosity three-dimensional consistency loss and depth optical flow consistency loss, the luminosity three-dimensional consistency loss represents the difference between a reconstructed image and the reference image, and the depth optical flow consistency loss represents the pixel of the source view angle and pseudo optical flow information formed by a matching point of the source view angle under the reference view angle;
step S2: training the pre-trained deep learning three-dimensional reconstruction model by taking a set second loss function as an optimization target to obtain an optimized three-dimensional reconstruction model, wherein the second loss function is constructed by estimating an uncertainty mask of a pre-training stage, and the uncertainty mask is used for representing an effective region in an input image.
2. The method of claim 1, wherein the first loss function is set to:
Lssp=Lpc+λLfc
wherein L ispcRepresenting a loss of photometric stereo coherence, LfcIndicating the depth optical flow uniformity loss, λ is a set constant.
3. The method of claim 2, wherein the photometric stereo consistency loss is calculated according to the following steps:
computing a pixel p in a source view jiCorresponding pixel in a reference view
Figure FDA0003202409270000011
Expressed as:
Figure FDA0003202409270000012
normalizing the result to obtain the coordinates in the corresponding image:
Figure FDA0003202409270000013
obtaining corresponding reconstructed image based on image of source view j through micro bilinear interpolation operation
Figure FDA0003202409270000014
And in the process of reconstructing the image, obtaining a binary mask MjFor representing the active area in the reconstructed image;
according to the obtained binary mask MjThe photometric stereo consistency loss was calculated as:
Figure FDA0003202409270000015
wherein Δ represents gradients of an image in x and y directions, [ indicates a dot-by-dot operation, [ i ] (1 ≦ i ≦ HW) represents a position index of a pixel in the image, H and W represent a height and a width of the image, respectively, [ K ]1,T1],[Kj,Tj]Is a pair of multi-view images (I)1,Ij) Corresponding internal and external parameters of the camera, I1A reference view angle picture is represented,
Figure FDA0003202409270000021
representing an image reconstructed based on the image of the source view j.
4. The method of claim 2, wherein the depth-optic flow consistency loss is calculated according to the steps of:
the method comprises the steps of utilizing a data set to pre-train an optical flow learning network, inputting a pairwise visual angle pair consisting of a reference visual angle and a source visual angle, and outputting a forward optical flow graph F between the reference visual angle and the source visual angle1jAnd a reverse light flow pattern Fj1
Depth map D to be predicted1Conversion to virtual cross-view optical flow
Figure FDA0003202409270000022
By forward flow of light F1jWith flow of backward light Fj1Calculate the occlusion mask O1jExpressed as:
O1j={|F1j+Fj1|>∈}
a depth optical flow consistency loss is calculated, expressed as follows:
Figure FDA0003202409270000023
where e is a set threshold, i (1 ≦ i ≦ HW) represents the position index of the pixel in the image, H and W represent the height and width of the image, respectively, piIs a pixel in the source view j, D1Is a depth map at a reference view predicted to correspond to a pair of multi-view images, F1j(pi) Representing a pixel p in a forward light flow graph from a reference view to a source view jiThe value of the optical flow of (a),
Figure FDA0003202409270000024
representing a pixel p in a virtual light flow graph from a reference view to a source view jiThe optical flow value of (a).
5. The method according to claim 1, characterized in that a Monte-card discarding layer is arranged on a bottleneck layer of the deep learning three-dimensional reconstruction model for estimating uncertainty in a pre-training process through multiple sampling.
6. The method of claim 5, wherein the second loss function is calculated according to the following steps:
by sampling different pre-training model weights, the uncertainty of the model is estimated, expressed as:
Figure FDA0003202409270000025
wherein
Figure FDA0003202409270000026
Is the result of sampling, T is the number of samples, D1,tIs the depth map of the t-th sample prediction;
using the mean of T samples as a pseudo label, it is expressed as:
Figure FDA0003202409270000031
computing a binary mask from the estimated uncertainty
Figure FDA0003202409270000032
Figure FDA0003202409270000033
Constructing the second loss function using the generated pseudo-label and the uncertainty binary mask, expressed as:
Figure FDA0003202409270000034
where ξ denotes a set threshold value, D1,τDepth map, σ, representing the prediction in step S2tRepresenting the occasional uncertainty plot corresponding to the t-th sample.
7. The method of claim 2, wherein the first loss function is modified to:
Figure FDA0003202409270000035
wherein σ2Is a contingent uncertainty.
8. A method of reconstructing a three-dimensional image, comprising the steps of:
constructing a multi-view image pair by using the shot images;
solving the camera pose according to the camera internal reference and the multi-view image;
inputting a multi-view image pair into an optimized three-dimensional reconstruction model obtained according to the method of any one of claims 1 to 7 for depth estimation of the multi-view image;
and (4) integrating the depth information under multiple visual angles to obtain the three-dimensional information of the scene, and further obtaining an image three-dimensional model.
9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7 or 8.
10. A computer device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 7 or 8 when executing the program.
CN202110907900.4A 2021-08-09 2021-08-09 Method for eliminating uncertainty of self-supervision three-dimensional reconstruction Active CN113592913B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110907900.4A CN113592913B (en) 2021-08-09 2021-08-09 Method for eliminating uncertainty of self-supervision three-dimensional reconstruction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110907900.4A CN113592913B (en) 2021-08-09 2021-08-09 Method for eliminating uncertainty of self-supervision three-dimensional reconstruction

Publications (2)

Publication Number Publication Date
CN113592913A true CN113592913A (en) 2021-11-02
CN113592913B CN113592913B (en) 2023-12-26

Family

ID=78256351

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110907900.4A Active CN113592913B (en) 2021-08-09 2021-08-09 Method for eliminating uncertainty of self-supervision three-dimensional reconstruction

Country Status (1)

Country Link
CN (1) CN113592913B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114022713A (en) * 2021-11-10 2022-02-08 重庆紫光华山智安科技有限公司 Model training method, system, device and medium
CN114782911A (en) * 2022-06-20 2022-07-22 小米汽车科技有限公司 Image processing method, device, equipment, medium, chip and vehicle
CN114820755A (en) * 2022-06-24 2022-07-29 武汉图科智能科技有限公司 Depth map estimation method and system
CN116912148A (en) * 2023-09-12 2023-10-20 深圳思谋信息科技有限公司 Image enhancement method, device, computer equipment and computer readable storage medium
CN117218715A (en) * 2023-08-04 2023-12-12 广西壮族自治区通信产业服务有限公司技术服务分公司 Method, system, equipment and storage medium for identifying few-sample key nodes

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163246A (en) * 2019-04-08 2019-08-23 杭州电子科技大学 The unsupervised depth estimation method of monocular light field image based on convolutional neural networks
CN110246212A (en) * 2019-05-05 2019-09-17 上海工程技术大学 A kind of target three-dimensional rebuilding method based on self-supervisory study
US20200265590A1 (en) * 2019-02-19 2020-08-20 The Trustees Of The University Of Pennsylvania Methods, systems, and computer readable media for estimation of optical flow, depth, and egomotion using neural network trained using event-based learning
US20210090279A1 (en) * 2019-09-20 2021-03-25 Google Llc Depth Determination for Images Captured with a Moving Camera and Representing Moving Features
CN112767468A (en) * 2021-02-05 2021-05-07 中国科学院深圳先进技术研究院 Self-supervision three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement
CN113066168A (en) * 2021-04-08 2021-07-02 云南大学 Multi-view stereo network three-dimensional reconstruction method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200265590A1 (en) * 2019-02-19 2020-08-20 The Trustees Of The University Of Pennsylvania Methods, systems, and computer readable media for estimation of optical flow, depth, and egomotion using neural network trained using event-based learning
CN110163246A (en) * 2019-04-08 2019-08-23 杭州电子科技大学 The unsupervised depth estimation method of monocular light field image based on convolutional neural networks
CN110246212A (en) * 2019-05-05 2019-09-17 上海工程技术大学 A kind of target three-dimensional rebuilding method based on self-supervisory study
US20210090279A1 (en) * 2019-09-20 2021-03-25 Google Llc Depth Determination for Images Captured with a Moving Camera and Representing Moving Features
CN112767468A (en) * 2021-02-05 2021-05-07 中国科学院深圳先进技术研究院 Self-supervision three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement
CN113066168A (en) * 2021-04-08 2021-07-02 云南大学 Multi-view stereo network three-dimensional reconstruction method and system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114022713A (en) * 2021-11-10 2022-02-08 重庆紫光华山智安科技有限公司 Model training method, system, device and medium
CN114782911A (en) * 2022-06-20 2022-07-22 小米汽车科技有限公司 Image processing method, device, equipment, medium, chip and vehicle
CN114820755A (en) * 2022-06-24 2022-07-29 武汉图科智能科技有限公司 Depth map estimation method and system
CN117218715A (en) * 2023-08-04 2023-12-12 广西壮族自治区通信产业服务有限公司技术服务分公司 Method, system, equipment and storage medium for identifying few-sample key nodes
CN116912148A (en) * 2023-09-12 2023-10-20 深圳思谋信息科技有限公司 Image enhancement method, device, computer equipment and computer readable storage medium
CN116912148B (en) * 2023-09-12 2024-01-05 深圳思谋信息科技有限公司 Image enhancement method, device, computer equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN113592913B (en) 2023-12-26

Similar Documents

Publication Publication Date Title
Bloesch et al. Codeslam—learning a compact, optimisable representation for dense visual slam
Tang et al. Learning guided convolutional network for depth completion
Bozic et al. Transformerfusion: Monocular rgb scene reconstruction using transformers
CN113592913B (en) Method for eliminating uncertainty of self-supervision three-dimensional reconstruction
US11232286B2 (en) Method and apparatus for generating face rotation image
Hu et al. Deep depth completion from extremely sparse data: A survey
CN105654492B (en) Robust real-time three-dimensional method for reconstructing based on consumer level camera
CN111723707B (en) Gaze point estimation method and device based on visual saliency
CN114339409B (en) Video processing method, device, computer equipment and storage medium
JP2023545190A (en) Image line-of-sight correction method, device, electronic device, and computer program
WO2023015414A1 (en) Method for eliminating uncertainty in self-supervised three-dimensional reconstruction
CN114429555A (en) Image density matching method, system, equipment and storage medium from coarse to fine
Gomes et al. Spatio-temporal graph-RNN for point cloud prediction
Leite et al. Exploiting motion perception in depth estimation through a lightweight convolutional neural network
JP2023522041A (en) A Reinforcement Learning Model to Label Spatial Relationships Between Images
CN117218246A (en) Training method and device for image generation model, electronic equipment and storage medium
Jung et al. Multi-task learning framework for motion estimation and dynamic scene deblurring
Tsuji et al. Non-guided depth completion with adversarial networks
CN117218713A (en) Action resolving method, device, equipment and storage medium
JP2024521816A (en) Unrestricted image stabilization
Gomes Graph-based network for dynamic point cloud prediction
KR20230083212A (en) Apparatus and method for estimating object posture
Xie et al. Effective convolutional neural network layers in flow estimation for omni-directional images
CN114841870A (en) Image processing method, related device and system
Sun et al. Unsupervised learning of optical flow in a multi-frame dynamic environment using temporal dynamic modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant