CN111899295A

CN111899295A - Monocular scene depth prediction method based on deep learning

Info

Publication number: CN111899295A
Application number: CN202010508803.3A
Authority: CN
Inventors: 姚莉; 缪静
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-06-06
Filing date: 2020-06-06
Publication date: 2020-11-06
Anticipated expiration: 2040-06-06
Also published as: CN111899295B

Abstract

A monocular scene depth prediction method based on deep learning is suitable for monocular pictures or videos, a calibrated binocular color image is used for training a depth prediction model, a DenseNet convolution module is used for extracting a feature space in a network architecture, and dense blocks and transition layers in the feature space are used for directly connecting each layer in the network with the previous layer so as to realize the repeated utilization of features; the binocular matching loss is improved, the depth prediction problem is regarded as the image reconstruction problem, the color image and the parallax image of the input left viewpoint are sampled to generate a virtual color image and a parallax image, and the consistency of the generated virtual view and the corresponding input right viewpoint image is restrained at an RGB level and a parallax level by utilizing a stereo matching algorithm of a binocular image pair, so that better depth is obtained; the depth smoothing loss is improved, the high-quality dense depth map can be generated, the problem of artifacts caused by shielding in monocular scene depth prediction is effectively solved, and the requirement of converting 2D (two-dimensional) to 3D (three-dimensional) of a plurality of real scenes indoors and outdoors can be met.

Description

Monocular scene depth prediction method based on deep learning

Technical Field

The invention relates to a monocular scene depth prediction method based on depth learning, and belongs to the field of computer vision and image processing.

Background

Monocular depth prediction is a research topic of great interest in computer vision, and has wide application value in the fields of automatic driving, VR game production, movie production and the like. However, there are still many problems to be solved in this field, such as:

1) the process of collecting depth data by using radar laser consumes a lot of energy and is greatly influenced by weather;

2) the generated depth image generates artifacts and smear due to illumination shadow or occlusion in the original image;

3) the depth information method based on sparse depth map recovery has the problem of discontinuous edge depth;

4) the depth model is not completely differentiable, and the calculability of the gradient in optimization is lost, so that the training is suboptimal;

5) the image generation model does not have the capability of scaling to images of large output resolution;

6) the generalization capability of the model is generally limited by training data.

Disclosure of Invention

The invention provides a monocular scene depth prediction method based on depth learning to solve the problems, the method is suitable for monocular pictures or videos, the method can obtain the dense depth map of the scene image, the accuracy rate can reach 91%, and the depth prediction requirements of a plurality of indoor and outdoor real scenes can be met.

In order to achieve the purpose, the technical scheme of the invention is as follows: a monocular scene depth prediction method based on depth learning takes a calibrated color image pair as training input, a DenseNet convolution module is used for improving the network architecture of an encoder part, loss constraint is strengthened at a plurality of layers of binocular stereo matching and image smoothing, occlusion problem is improved by post-processing, and the monocular image depth prediction effect is improved generally, the method comprises the following steps:

step 1: and preprocessing operation, namely resizing the high-resolution binocular color image pair to be 256x512, and performing random turning and contrast transformation on the image pair with unified size to perform a plurality of combined data enhancement transformations to increase the amount of input data, and then inputting the input data into an encoder of a convolutional network.

Step 2: the network encoder part extracts visual features by using a DenseNet-based convolution module, improves the transmission of information and gradient in a network through dense connection, relieves the problem of gradient disappearance and strengthens feature propagation;

and step 3: setting a skip action domain in a decoder part, directly splicing a part of feature maps in the encoding process into the decoding process, using 64 7x7 convolution kernels to realize up-sampling, and using a sigmoid function as an activation function to generate parallax;

and 4, step 4: the binocular matching loss and the depth smoothing loss are enhanced, the model is optimized in iteration, the prediction precision is improved, and the edge of the depth map is kept while the depth map is smoothed;

and 5: the post-processing part is optimized. On one hand, due to the existence of stereo occlusion, the left view can see more content on the left side, and the information of the right part of an object is easy to lose, so that the input image is turned over and a corresponding disparity map is generated, the disparity map is combined with the disparity map of the original image, proper edge information is selected, the output disparity is optimized, and the problem of occlusion of the image edge is solved; on the other hand, based on the object detection technology, the output disparity map is corrected by using the original image, so that the object edge in the scene is highlighted, and the smear is effectively eliminated.

As an improvement of the present invention, step 2 is specifically as follows, the training data set of the depth prediction model is a calibrated binocular color image pair, the size is adjusted to 256 × 512 as the input of the network, and the training data set is subjected to 64 7 × 7 convolution kernel convolutions and maximum pooling once to obtain the depth prediction model

The size of the tenar, then into four modules consisting of a denseblock and a transition layer;

the number of layers included in the four denseblocks is respectively 2, 6, 12 and 24, the number of network layers is continuously deepened, the growth rate (growth _ rate) of the denseblocks in all the denseblocks is set to be 32, the default bn _ size is 4, 1x1 convolution is added to the bottom layer in the dense blocks before 3x3 convolution, the purpose is to reduce the number of parameters of the network, the transition layer integrates global information between every two dense blocks by using average pooling, and the compactness of the model is improved. The transmission of information and gradient in the network is improved through dense connection, the network is deepened, the problem of gradient disappearance is relieved, and feature propagation is enhanced.

As an improvement of the present invention, the step 4 specifically includes the following steps: in order to optimize the model and obtain a more accurate dense depth map, binocular matching loss and depth smoothing loss are added, and the constraint of a loss function on the network is strengthened:

(4.1) binocular matching loss: the matching cost calculation is an important measurement standard of a stereo matching algorithm, and the similarity of the relative views of the reconstructed image and the sampled image is compared by utilizing the correlation between the pixels of the binocular image, so that the stereo matching can be enhanced.

Under the condition of spatial similarity, strong correlation exists between pixels of RGB images, and the original left image is assumed to be

(i, j represents the position coordinates of the pixel), and the reconstructed left image can be obtained through the distortion operation according to the predicted parallax and the original right image, wherein the left image is obtained through the distortion operation

According to the corresponding parallax value of each pixel point of the left image, the corresponding pixel point is searched in the right image and then interpolation is carried out, the combination of L1 and a single-scale SSIM item is used as the luminosity cost Cap of image reconstruction, and the reconstructed view is calculated

View with real input

The matching cost of (a) is low,

at the parallax level, the invention tries to make the left viewpoint parallax map equal to the projected right viewpoint parallax map, specifically d taking the right map as the reference image^rD as input image for reconstruction operation, and the left image as reference image^lAs the input disparity map, the disparity map is distortedTo obtain d^lIs reconstructed to obtain a disparity map

C_lrFacilitates the consistency of the predicted left-right disparity,

(4.2) depth smoothing loss: when an edge exists in an image, a large gradient value is certain, conversely, when a smooth part exists in the image, the gray value change is small, the corresponding gradient is also small, the gray value change of the edge of the directly predicted parallax image (depth image) contour is obvious, the layering sense is strong, the problem of discontinuous depth often occurs at the image gradient, and when the occlusion problem is met, a necessary object boundary is also kept;

since a dense disparity map is required, the disparity gradient can be smoothed locally for the purpose of disparity

An L1 penalty is made, however, this assumption penalizes edges incorrectly and does not apply to object boundaries that typically correspond to regions of high variation in pixel intensity. Thus, by introducing an edge perception term e, image gradients are used

The influence of error punishment is reduced by the edge perception item,

on the basis of introducing the edge perception item e, a method for finding a field based on cross self-adaptation is added, the idea is to limit the support field of a certain pixel point p and reduce the influence brought by the punishment of errors on the edge. For the point p, four arms respectively extend from the upper, lower, left and right sides, and the point meeting the conditions is accommodated in the supporting area U of the point p_pThe iteration is continued until the termination when the following conditions are not met. (that is the strength is notCan exceed too much, can not be too far away)

. In the scheme, in the binocular matching loss part, the matching loss between binocular image pairs is enhanced based on a reconstruction technology, and the constraint of a parallax layer is increased on the basis of an RGB layer. And according to the original right viewpoint RGB image and the predicted disparity map, warping to obtain a corresponding reconstructed left image, and comparing the reconstructed left RGB image and disparity map with the original left image and predicted disparity map to obtain binocular matching loss.

In the depth smoothing loss part, the depth map is smoothed by the method, and the depth change of the object edge and the occlusion is kept. Performing smoothing operation on part of the disparity map, introducing an edge perception item e to reduce the influence caused by error punishment for the object boundary of a pixel intensity high-change area, and introducing a cross-based self-adaptive domain finding method to limit the support domain U of a certain pixel point p_pMake U_pAll pixel points in the set must not exceed too much intensity and not too far away from point p, and the iteration is terminated until the following condition is not met.

As an improvement of the present invention, the step (5) is specifically to improve the generated disparity map as follows. In the aspect of relieving the occlusion of the image edge, for the input picture I, not only the disparity map D is calculated_lAnd calculating a disparity map D for the mirror-inverted image I' of the picture I_l' and then inverting the disparity map to obtain a disparity map D_l″，D_l"and D_lAnd (4) aligning. Binding of D_l"left 5% and D_lThe right 5% of the image and the average between the two constitutes the final result to reduce the effect of stereo occlusion at the edges of the image. Meanwhile, a smear part for eliminating the edge of an object is added, the object possibly existing in the original image scene is identified based on the object identification technology and is aligned with the output disparity map, the pixel of the edge of the object in the disparity map is enhanced, the pixel of the smear part is set as the average value of the field pixels for eliminating the edge, and the method is prominentObjects in the scene are improved, and the quality of the disparity map is improved. In this scheme, due to the existence of stereo occlusion, the left view can see more content on the left side, and the information on the right part of the object is easily lost, so for the input picture I, not only the disparity map D thereof is calculated_lAnd calculating a disparity map D for the mirror-inverted image I' of the picture I_l' selecting the appropriate edge information of the two disparity maps, and optimizing the output disparity to reduce the influence of stereo occlusion at the image edge.

Because a moving object may exist in a scene, a fuzzy part exists in an input image, and in order to highlight the object in the scene, the object identification technology is used for identifying the object possibly existing in the original image scene, aligning the object with an output disparity map, enhancing the pixels at the edge of the object in the disparity map, setting the pixels at the smear part as the average value of the pixels in the field of edge removal, eliminating the smear phenomenon to a certain extent, and improving the quality of the disparity map.

The technical scheme is suitable for monocular pictures or videos, the calibrated binocular color image is used for training a depth prediction model, a DenseNet convolution module is used for extracting a feature space in a network architecture, and dense blocks and transition layers are used, so that each layer in the network is directly connected with the previous layer, and the feature is recycled; the binocular matching loss is improved, the depth prediction problem is regarded as the image reconstruction problem, the color image and the parallax image of the input left viewpoint are sampled to generate a virtual color image and a parallax image, and the consistency of the generated virtual view and the corresponding input right viewpoint image is restrained at an RGB level and a parallax level by utilizing a stereo matching algorithm of a binocular image pair, so that better depth is obtained; the depth smoothing loss is improved, the L1 punishment is carried out on a smoother part or a discontinuous depth region in an image, for an object boundary or a shielding part of a region with high pixel intensity variation, a method for finding a field based on intersection self-adaption is added on the basis of introducing an edge perception item e, the support field of a certain pixel point p is limited, and the influence brought by wrong punishment edges is reduced; in post-processing optimization, based on an object detection technology, an original image is used for correcting an output disparity map, the edge of an object in a scene is highlighted, and smearing is effectively eliminated; in order to produce better effect on the right side of an object, an input original image is turned to generate a corresponding turning disparity map, the 5% of the original disparity map and the turning disparity map are combined, and the average value of the two is taken in the middle part to form a final result; because the disparity map predicted by the moving object may have blur and smear, in order to highlight the object in the scene, the invention identifies the object possibly existing in the original image scene based on the object identification technology, aligns with the output disparity map, enhances the pixels at the edge of the object in the disparity map, sets the pixels at the smear part as the average value of the field pixels of the edge removal, eliminates the smear phenomenon to a certain extent, and improves the quality of the disparity map. The method can generate the high-quality dense depth map, effectively improve the problem of artifacts caused by shielding in monocular scene depth prediction, and meet the requirement of converting 2D (two-dimensional) to 3D (three-dimensional) of a plurality of indoor and outdoor real scenes.

Compared with the prior art, the invention has the following advantages: 1) the method improves the convolution module on the basis of obtaining the scene depth prediction model by utilizing the binocular image pair training. By utilizing the characteristic that the feature map in the DenseNet module is densely linked, each layer receives the feature input of the preorder layer and outputs the feature of the layer to all subsequent layers, thereby reducing the information, reducing the consumption in the transmission process and improving the depth prediction precision; 2) in the binocular matching loss part, a reconstruction technology is utilized, left-right parallax consistency constraint is added on the basis of binocular RGB image structure similarity constraint, and the advantage of using binocular images as training data is fully utilized; 3) according to the method, in a parallax smooth loss part, parallax smooth operation is carried out on the part of a parallax image to obtain a smooth depth image, meanwhile, an edge sensing item and a method of self-adaptive finding field are introduced, image and sheltered edge information are reserved, and a clearer depth image is obtained; 4) the invention optimizes the edge of the output disparity map, and obtains more information of the right edge of the image by inverting the mirror surface of the input image, thereby effectively relieving the problem of shielding; 5) the method can effectively eliminate the smear problem in the parallax image, and can enhance the pixel value of the edge of the object and improve the parallax prediction precision of a single object by detecting the object in the original image and aligning the object with the parallax image.

Drawings

Figure 1 is an overall flow chart of the present invention,

figure 2 is a schematic diagram of the loss of binocular matching,

fig. 3 is a schematic diagram of finding a reasonable domain by adaptation.

Detailed Description

The invention is explained in detail below with reference to the drawings, and the specific steps are as follows.

Example 1: as shown in fig. 1, a monocular scene depth prediction method based on depth learning, using a calibrated color image pair as a training input, using a DenseNet convolution module to improve a network architecture of an encoder part, strengthening loss constraints at multiple levels of binocular stereo matching and image smoothing, improving occlusion problems by post-processing, and generally improving monocular image depth prediction effect, includes the following steps:

and 4, step 4: the binocular matching loss and the depth smoothing loss are enhanced, the model is optimized in iteration, the prediction precision is improved, and the edge of the depth map is kept while the depth map is smoothed.

Step 2 is specifically as follows, the training data set of the depth prediction model is a calibrated binocular color image pair, the size of the binocular color image pair is adjusted to be 256x512, the binocular color image pair is used as the input of the network, and the binocular color image pair is subjected to 64-times 7x7 convolution kernel convolution and maximum pooling once to obtain the binocular color image pair

the number of layers included in the four denseblocks is respectively 2, 6, 12 and 24, the number of network layers is continuously deepened, the growth rate (growth _ rate) of the denseblocks in all the denseblocks is set to be 32, the default bn _ size is 4, 1x1 convolution is added to the bottom layer in the dense blocks before 3x3 convolution, the purpose is to reduce the number of parameters of the network, the transition layer integrates global information between every two dense blocks by using average pooling, and the compactness of the model is improved. The transmission of information and gradient in the network is improved through dense connection, the network is deepened, the problem of gradient disappearance is relieved, and feature propagation is enhanced. The step 4 is specifically as follows: in order to optimize the model and obtain a more accurate dense depth map, binocular matching loss and depth smoothing loss are added, and the constraint of a loss function on the network is strengthened:

View with real input

The matching cost of (a) is low,

at the parallax level, the invention tries to make the left viewpoint parallax map equal to the projected right viewpoint parallax map, specifically d taking the right map as the reference image^rD as input image for reconstruction operation, and the left image as reference image^lAs an input disparity map, d is obtained through a warping operation^lIs reconstructed to obtain a disparity map

C_lrFacilitating predicted left-right visionThe difference is consistent with each other,

The influence of error punishment is reduced by the edge perception item,

on the basis of introducing the edge perception item e, a method for finding a field based on cross self-adaptation is added, the idea is to limit the support field of a certain pixel point p and reduce the influence brought by the punishment of errors on the edge. For the point p, four arms respectively extend from the upper, lower, left and right sides, and the point meeting the conditions is accommodated in the supporting area U of the point p_pThe iteration is continued until the termination when the following conditions are not met. (i.e. the strength must not exceed too much, the distance must not be too far)

And (5) specifically processing as follows, and improving the generated disparity map. In the aspect of relieving the occlusion of the image edge, for the input picture I, not only the disparity map D is calculated_lAnd calculating a disparity map D for the mirror-inverted image I' of the picture I_l' and then inverting the disparity map to obtain a disparity map D_l″，D_l"and D_lAnd (4) aligning. Binding of D_l"left 5% and D_lThe right 5% of the image and the average between the two constitutes the final result to reduce the effect of stereo occlusion at the edges of the image. Meanwhile, a smear part for eliminating the edge of an object is added, the object possibly existing in the original image scene is identified based on an object identification technology and is aligned with the output disparity map, the pixels at the edge of the object in the disparity map are enhanced, the pixels at the smear part are set as the average value of the field pixels for eliminating the edge, the object in the scene is highlighted, and the quality of the disparity map is improved. In this scheme, due to the existence of stereo occlusion, the left view can see more content on the left side, and the information on the right part of the object is easily lost, so for the input picture I, not only the disparity map thereof is calculatedD_lAnd calculating a disparity map D for the mirror-inverted image I' of the picture I_l' selecting the appropriate edge information of the two disparity maps, and optimizing the output disparity to reduce the influence of stereo occlusion at the image edge.

The application example is as follows:

as shown in fig. 1, the present invention mainly includes three parts, namely, improvement of encoder module of the disparity prediction network, enhancement of the loss function algorithm, and optimization of post-processing, which are described in detail below for each part:

1. DenseNet module-based ground depth prediction network structure

The network inputs 256x512 binocular RGB image pairs after resizing, and in the encoder part of the network, conv and maximum pooling maxpool which are subjected to one-time convolution of 64 7x7 convolution kernels become 1/4 in image size and the number of channels is 64.

The data then enters four modules consisting of a denseblock and transition layers, the growth rate (growth _ rate) of a denselayer in all the denseblocks is set to 32, the default bn _ size is 4, the cottleneck layer of denselayer is increased by Conv1x1, 128 channels (bn _ size growth _ rate) are output, and 32 channels (growth _ rate) are output at Conv3x 3. The transition layer uses the average pooling integrated global information and sets the number of output channels to be half of the number of input channels.

2. Performing algorithm optimization on binocular matching loss and depth smoothing loss respectively

(1) Binocular matching loss calculation is an important metric for stereo matching algorithms. Based on a reconstruction technology, the image reconstruction technology is used in RGB and parallax layers, the similarity of the relative views of a reconstructed image and a sampled image is compared, and the calculation cost is aggregated:

use of a combination of L1 and single-scale SSIM terms as the photometric cost C for image reconstruction_apThe image that facilitates reconstruction is visually similar to the training input. Assume the original left image is

(i, j represents the position coordinates of the pixel points), and a reconstructed left image can be obtained through distortion operation according to the predicted parallax and the original right image. Resulting from the twisting operation herein

According to the corresponding parallax value of each pixel point of the left image, the corresponding pixel point is searched in the right image and then interpolation is carried out to obtain the parallax value. C_apBy comparing input artwork

And reconstructing the picture

It follows that a simplified SSIM is used which contains a 3x3 block filter instead of a gaussian filter, where N is the number of pixels:

in order to generate more accurate disparity maps, the network is trained to predict the disparity of the left and right images, and only the left view is used as the input of the convolution part of the network, and the disparity map d which not only takes the left image as a reference image is output^lAnd a parallax map d using the right image as a reference image^r. To ensure consistency, a L1 left and right disparity consistency loss was introduced as part of the model. This cost attempts to make the left view disparity map equal to the projected right view disparity map, specifically d, which will be referenced to the right map^rD as input image for reconstruction operation, and the left image as reference image^lAs an input disparity map, d is obtained through a warping operation^lIs reconstructed to obtain a disparity map

Note that here we get a reconstructed disparity map, not a reconstructed left map, so the left-right disparity consistency loss can be written as C_lrThe predicted left-right disparity is facilitated to be consistent.

Where N is the number of pixels,

i.e. reconstructed

As with all other terms, this cost also corresponds to the disparity map for the right viewpoint, which can be predicted at all output scales.

(2) The gray level of the outline edge of the predicted parallax image is obviously changed, the layering sense is strong, and the problem of discontinuous depth often occurs at the image gradient. Since dense disparity maps are required here, it is necessary not only to keep the smoothness locally, but also to retain information on object edges and occlusions.

For parallax to remain locally smooth, this document deals with parallax gradients

An L1 penalty is imposed. However, this assumption penalizes edges erroneously and is not applicable to object boundaries that typically correspond to regions of high variation in pixel intensity. Thus, by introducing an edge perception term e, image gradients are used

The influence caused by error punishment is reduced through the edge perception item. Parallax smoothing loss

The definition is as follows:

here, the parallax gradients in the x-direction and the y-direction are used respectively

And image gradient

3. In the post-processing part, the problem of shielding the image edge is relieved, the smear phenomenon of a moving object is effectively eliminated, and the output disparity map is optimized.

Because of the existence of stereo occlusion, the left view can see more content on the left side, and the information of the right part of the object is easy to lose, therefore, for the input picture I, not only the disparity map D is calculated_lAnd calculating a disparity map D for the mirror-inverted image I' of the picture I_l' selecting the appropriate edge information of the two disparity maps, and optimizing the output disparity to reduce the influence of stereo occlusion at the image edge.

It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and all equivalent substitutions or substitutions made on the above-mentioned technical solutions belong to the scope of the present invention.

Claims

1. A monocular scene depth prediction method based on deep learning is characterized by comprising the following steps:

and 5: the post-processing part is optimized, the input image is inverted to generate corresponding parallax, the parallax of the original image is combined with the parallax of the original image to relieve the problem of edge shielding, the original image is aligned with the parallax image based on an object detection technology, pixels at the edge of the object are enhanced, and the phenomenon of smear can be eliminated to a certain extent.

2. The method of claim 1, wherein the step 2 is specifically as follows, and the depth prediction isThe training data set of the model is a calibrated binocular color image pair, the size of the binocular color image pair is adjusted to be 256x512, the binocular color image pair is used as the input of the network, and the binocular color image pair is subjected to 64 times of 7x7 convolution kernel convolution and maximum pooling once to obtain the binocular color image pair

the number of layers included in the four denseblocks is 2, 6, 12 and 24 respectively, the number of network layers is continuously increased, the growth rate (growth _ rate) of the denseblocks in all the denseblocks is set to be 32, the default bn _ size is set to be 4, and the Bottleneck layer in the dense block is added with convolution of 1x1 before convolution of 3x 3.

3. The method for predicting the depth of a monocular scene based on deep learning according to claim 1, wherein the step 4 is specifically as follows:

(3.1) binocular matching loss: the matching cost calculation is an important measurement standard of a stereo matching algorithm, and by utilizing the correlation between pixels of a binocular image,

(i, j represents the position coordinates of the pixel points), and obtaining a reconstructed left image through a warping operation according to the predicted parallax and the original right image, wherein the left image is obtained through the warping operation

View with real input

The matching cost of (a) is low,

on the parallax level, the left viewpoint parallax image is equal to the projected right viewpoint parallax image, and d is taken as the reference image of the right image^rD as input image for reconstruction operation, and the left image as reference image^lAs an input disparity map, d is obtained through a warping operation^lIs reconstructed to obtain a disparity map

C_lrFacilitates the consistency of the predicted left-right disparity,

(3.2) depth smoothing loss: using image gradients by introducing an edge-perception term e

The influence of error punishment is reduced by the edge perception item,

on the basis of introducing the edge perception item e, a method for finding the field based on cross self-adaptation is added, four arms are respectively extended from the upper part, the lower part, the left part and the right part of the point p, and the point meeting the conditions is accommodated in a supporting area U of the point p_pIterating until the iteration is terminated when the following conditions are not met;

4. the method of claim 1, wherein the step (5) is specifically to improve the generated disparity map, and to reduce the occlusion of the image edge, not only the disparity map D of the input picture I is calculated_lAnd calculating a disparity map D for the mirror-inverted image I' of the picture I_l' and then inverting the disparity map to obtain a disparity map D_l″，D_l"and D_lAligning; binding of D_l"left 5% and D_lThe right 5% of the image and the average between the two constitutes the final result to reduce the effect of stereo occlusion at the edges of the image.