Nothing Special   »   [go: up one dir, main page]

CN111899295A - Monocular scene depth prediction method based on deep learning - Google Patents

Monocular scene depth prediction method based on deep learning Download PDF

Info

Publication number
CN111899295A
CN111899295A CN202010508803.3A CN202010508803A CN111899295A CN 111899295 A CN111899295 A CN 111899295A CN 202010508803 A CN202010508803 A CN 202010508803A CN 111899295 A CN111899295 A CN 111899295A
Authority
CN
China
Prior art keywords
image
depth
parallax
binocular
disparity map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010508803.3A
Other languages
Chinese (zh)
Other versions
CN111899295B (en
Inventor
姚莉
缪静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202010508803.3A priority Critical patent/CN111899295B/en
Publication of CN111899295A publication Critical patent/CN111899295A/en
Application granted granted Critical
Publication of CN111899295B publication Critical patent/CN111899295B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Image Processing (AREA)

Abstract

A monocular scene depth prediction method based on deep learning is suitable for monocular pictures or videos, a calibrated binocular color image is used for training a depth prediction model, a DenseNet convolution module is used for extracting a feature space in a network architecture, and dense blocks and transition layers in the feature space are used for directly connecting each layer in the network with the previous layer so as to realize the repeated utilization of features; the binocular matching loss is improved, the depth prediction problem is regarded as the image reconstruction problem, the color image and the parallax image of the input left viewpoint are sampled to generate a virtual color image and a parallax image, and the consistency of the generated virtual view and the corresponding input right viewpoint image is restrained at an RGB level and a parallax level by utilizing a stereo matching algorithm of a binocular image pair, so that better depth is obtained; the depth smoothing loss is improved, the high-quality dense depth map can be generated, the problem of artifacts caused by shielding in monocular scene depth prediction is effectively solved, and the requirement of converting 2D (two-dimensional) to 3D (three-dimensional) of a plurality of real scenes indoors and outdoors can be met.

Description

Monocular scene depth prediction method based on deep learning
Technical Field
The invention relates to a monocular scene depth prediction method based on depth learning, and belongs to the field of computer vision and image processing.
Background
Monocular depth prediction is a research topic of great interest in computer vision, and has wide application value in the fields of automatic driving, VR game production, movie production and the like. However, there are still many problems to be solved in this field, such as:
1) the process of collecting depth data by using radar laser consumes a lot of energy and is greatly influenced by weather;
2) the generated depth image generates artifacts and smear due to illumination shadow or occlusion in the original image;
3) the depth information method based on sparse depth map recovery has the problem of discontinuous edge depth;
4) the depth model is not completely differentiable, and the calculability of the gradient in optimization is lost, so that the training is suboptimal;
5) the image generation model does not have the capability of scaling to images of large output resolution;
6) the generalization capability of the model is generally limited by training data.
Disclosure of Invention
The invention provides a monocular scene depth prediction method based on depth learning to solve the problems, the method is suitable for monocular pictures or videos, the method can obtain the dense depth map of the scene image, the accuracy rate can reach 91%, and the depth prediction requirements of a plurality of indoor and outdoor real scenes can be met.
In order to achieve the purpose, the technical scheme of the invention is as follows: a monocular scene depth prediction method based on depth learning takes a calibrated color image pair as training input, a DenseNet convolution module is used for improving the network architecture of an encoder part, loss constraint is strengthened at a plurality of layers of binocular stereo matching and image smoothing, occlusion problem is improved by post-processing, and the monocular image depth prediction effect is improved generally, the method comprises the following steps:
step 1: and preprocessing operation, namely resizing the high-resolution binocular color image pair to be 256x512, and performing random turning and contrast transformation on the image pair with unified size to perform a plurality of combined data enhancement transformations to increase the amount of input data, and then inputting the input data into an encoder of a convolutional network.
Step 2: the network encoder part extracts visual features by using a DenseNet-based convolution module, improves the transmission of information and gradient in a network through dense connection, relieves the problem of gradient disappearance and strengthens feature propagation;
and step 3: setting a skip action domain in a decoder part, directly splicing a part of feature maps in the encoding process into the decoding process, using 64 7x7 convolution kernels to realize up-sampling, and using a sigmoid function as an activation function to generate parallax;
and 4, step 4: the binocular matching loss and the depth smoothing loss are enhanced, the model is optimized in iteration, the prediction precision is improved, and the edge of the depth map is kept while the depth map is smoothed;
and 5: the post-processing part is optimized. On one hand, due to the existence of stereo occlusion, the left view can see more content on the left side, and the information of the right part of an object is easy to lose, so that the input image is turned over and a corresponding disparity map is generated, the disparity map is combined with the disparity map of the original image, proper edge information is selected, the output disparity is optimized, and the problem of occlusion of the image edge is solved; on the other hand, based on the object detection technology, the output disparity map is corrected by using the original image, so that the object edge in the scene is highlighted, and the smear is effectively eliminated.
As an improvement of the present invention, step 2 is specifically as follows, the training data set of the depth prediction model is a calibrated binocular color image pair, the size is adjusted to 256 × 512 as the input of the network, and the training data set is subjected to 64 7 × 7 convolution kernel convolutions and maximum pooling once to obtain the depth prediction model
Figure BDA0002527628800000021
The size of the tenar, then into four modules consisting of a denseblock and a transition layer;
the number of layers included in the four denseblocks is respectively 2, 6, 12 and 24, the number of network layers is continuously deepened, the growth rate (growth _ rate) of the denseblocks in all the denseblocks is set to be 32, the default bn _ size is 4, 1x1 convolution is added to the bottom layer in the dense blocks before 3x3 convolution, the purpose is to reduce the number of parameters of the network, the transition layer integrates global information between every two dense blocks by using average pooling, and the compactness of the model is improved. The transmission of information and gradient in the network is improved through dense connection, the network is deepened, the problem of gradient disappearance is relieved, and feature propagation is enhanced.
As an improvement of the present invention, the step 4 specifically includes the following steps: in order to optimize the model and obtain a more accurate dense depth map, binocular matching loss and depth smoothing loss are added, and the constraint of a loss function on the network is strengthened:
(4.1) binocular matching loss: the matching cost calculation is an important measurement standard of a stereo matching algorithm, and the similarity of the relative views of the reconstructed image and the sampled image is compared by utilizing the correlation between the pixels of the binocular image, so that the stereo matching can be enhanced.
Under the condition of spatial similarity, strong correlation exists between pixels of RGB images, and the original left image is assumed to be
Figure BDA0002527628800000022
(i, j represents the position coordinates of the pixel), and the reconstructed left image can be obtained through the distortion operation according to the predicted parallax and the original right image, wherein the left image is obtained through the distortion operation
Figure BDA0002527628800000023
According to the corresponding parallax value of each pixel point of the left image, the corresponding pixel point is searched in the right image and then interpolation is carried out, the combination of L1 and a single-scale SSIM item is used as the luminosity cost Cap of image reconstruction, and the reconstructed view is calculated
Figure BDA0002527628800000024
View with real input
Figure BDA0002527628800000025
The matching cost of (a) is low,
Figure BDA0002527628800000026
at the parallax level, the invention tries to make the left viewpoint parallax map equal to the projected right viewpoint parallax map, specifically d taking the right map as the reference imagerD as input image for reconstruction operation, and the left image as reference imagelAs the input disparity map, the disparity map is distortedTo obtain dlIs reconstructed to obtain a disparity map
Figure BDA0002527628800000027
ClrFacilitates the consistency of the predicted left-right disparity,
Figure BDA0002527628800000028
(4.2) depth smoothing loss: when an edge exists in an image, a large gradient value is certain, conversely, when a smooth part exists in the image, the gray value change is small, the corresponding gradient is also small, the gray value change of the edge of the directly predicted parallax image (depth image) contour is obvious, the layering sense is strong, the problem of discontinuous depth often occurs at the image gradient, and when the occlusion problem is met, a necessary object boundary is also kept;
since a dense disparity map is required, the disparity gradient can be smoothed locally for the purpose of disparity
Figure BDA0002527628800000031
An L1 penalty is made, however, this assumption penalizes edges incorrectly and does not apply to object boundaries that typically correspond to regions of high variation in pixel intensity. Thus, by introducing an edge perception term e, image gradients are used
Figure BDA0002527628800000032
The influence of error punishment is reduced by the edge perception item,
Figure BDA0002527628800000033
on the basis of introducing the edge perception item e, a method for finding a field based on cross self-adaptation is added, the idea is to limit the support field of a certain pixel point p and reduce the influence brought by the punishment of errors on the edge. For the point p, four arms respectively extend from the upper, lower, left and right sides, and the point meeting the conditions is accommodated in the supporting area U of the point ppThe iteration is continued until the termination when the following conditions are not met. (that is the strength is notCan exceed too much, can not be too far away)
Figure BDA0002527628800000034
. In the scheme, in the binocular matching loss part, the matching loss between binocular image pairs is enhanced based on a reconstruction technology, and the constraint of a parallax layer is increased on the basis of an RGB layer. And according to the original right viewpoint RGB image and the predicted disparity map, warping to obtain a corresponding reconstructed left image, and comparing the reconstructed left RGB image and disparity map with the original left image and predicted disparity map to obtain binocular matching loss.
In the depth smoothing loss part, the depth map is smoothed by the method, and the depth change of the object edge and the occlusion is kept. Performing smoothing operation on part of the disparity map, introducing an edge perception item e to reduce the influence caused by error punishment for the object boundary of a pixel intensity high-change area, and introducing a cross-based self-adaptive domain finding method to limit the support domain U of a certain pixel point ppMake UpAll pixel points in the set must not exceed too much intensity and not too far away from point p, and the iteration is terminated until the following condition is not met.
As an improvement of the present invention, the step (5) is specifically to improve the generated disparity map as follows. In the aspect of relieving the occlusion of the image edge, for the input picture I, not only the disparity map D is calculatedlAnd calculating a disparity map D for the mirror-inverted image I' of the picture Il' and then inverting the disparity map to obtain a disparity map Dl″,Dl"and DlAnd (4) aligning. Binding of Dl"left 5% and DlThe right 5% of the image and the average between the two constitutes the final result to reduce the effect of stereo occlusion at the edges of the image. Meanwhile, a smear part for eliminating the edge of an object is added, the object possibly existing in the original image scene is identified based on the object identification technology and is aligned with the output disparity map, the pixel of the edge of the object in the disparity map is enhanced, the pixel of the smear part is set as the average value of the field pixels for eliminating the edge, and the method is prominentObjects in the scene are improved, and the quality of the disparity map is improved. In this scheme, due to the existence of stereo occlusion, the left view can see more content on the left side, and the information on the right part of the object is easily lost, so for the input picture I, not only the disparity map D thereof is calculatedlAnd calculating a disparity map D for the mirror-inverted image I' of the picture Il' selecting the appropriate edge information of the two disparity maps, and optimizing the output disparity to reduce the influence of stereo occlusion at the image edge.
Because a moving object may exist in a scene, a fuzzy part exists in an input image, and in order to highlight the object in the scene, the object identification technology is used for identifying the object possibly existing in the original image scene, aligning the object with an output disparity map, enhancing the pixels at the edge of the object in the disparity map, setting the pixels at the smear part as the average value of the pixels in the field of edge removal, eliminating the smear phenomenon to a certain extent, and improving the quality of the disparity map.
Because a moving object may exist in a scene, a fuzzy part exists in an input image, and in order to highlight the object in the scene, the object identification technology is used for identifying the object possibly existing in the original image scene, aligning the object with an output disparity map, enhancing the pixels at the edge of the object in the disparity map, setting the pixels at the smear part as the average value of the pixels in the field of edge removal, eliminating the smear phenomenon to a certain extent, and improving the quality of the disparity map.
The technical scheme is suitable for monocular pictures or videos, the calibrated binocular color image is used for training a depth prediction model, a DenseNet convolution module is used for extracting a feature space in a network architecture, and dense blocks and transition layers are used, so that each layer in the network is directly connected with the previous layer, and the feature is recycled; the binocular matching loss is improved, the depth prediction problem is regarded as the image reconstruction problem, the color image and the parallax image of the input left viewpoint are sampled to generate a virtual color image and a parallax image, and the consistency of the generated virtual view and the corresponding input right viewpoint image is restrained at an RGB level and a parallax level by utilizing a stereo matching algorithm of a binocular image pair, so that better depth is obtained; the depth smoothing loss is improved, the L1 punishment is carried out on a smoother part or a discontinuous depth region in an image, for an object boundary or a shielding part of a region with high pixel intensity variation, a method for finding a field based on intersection self-adaption is added on the basis of introducing an edge perception item e, the support field of a certain pixel point p is limited, and the influence brought by wrong punishment edges is reduced; in post-processing optimization, based on an object detection technology, an original image is used for correcting an output disparity map, the edge of an object in a scene is highlighted, and smearing is effectively eliminated; in order to produce better effect on the right side of an object, an input original image is turned to generate a corresponding turning disparity map, the 5% of the original disparity map and the turning disparity map are combined, and the average value of the two is taken in the middle part to form a final result; because the disparity map predicted by the moving object may have blur and smear, in order to highlight the object in the scene, the invention identifies the object possibly existing in the original image scene based on the object identification technology, aligns with the output disparity map, enhances the pixels at the edge of the object in the disparity map, sets the pixels at the smear part as the average value of the field pixels of the edge removal, eliminates the smear phenomenon to a certain extent, and improves the quality of the disparity map. The method can generate the high-quality dense depth map, effectively improve the problem of artifacts caused by shielding in monocular scene depth prediction, and meet the requirement of converting 2D (two-dimensional) to 3D (three-dimensional) of a plurality of indoor and outdoor real scenes.
Compared with the prior art, the invention has the following advantages: 1) the method improves the convolution module on the basis of obtaining the scene depth prediction model by utilizing the binocular image pair training. By utilizing the characteristic that the feature map in the DenseNet module is densely linked, each layer receives the feature input of the preorder layer and outputs the feature of the layer to all subsequent layers, thereby reducing the information, reducing the consumption in the transmission process and improving the depth prediction precision; 2) in the binocular matching loss part, a reconstruction technology is utilized, left-right parallax consistency constraint is added on the basis of binocular RGB image structure similarity constraint, and the advantage of using binocular images as training data is fully utilized; 3) according to the method, in a parallax smooth loss part, parallax smooth operation is carried out on the part of a parallax image to obtain a smooth depth image, meanwhile, an edge sensing item and a method of self-adaptive finding field are introduced, image and sheltered edge information are reserved, and a clearer depth image is obtained; 4) the invention optimizes the edge of the output disparity map, and obtains more information of the right edge of the image by inverting the mirror surface of the input image, thereby effectively relieving the problem of shielding; 5) the method can effectively eliminate the smear problem in the parallax image, and can enhance the pixel value of the edge of the object and improve the parallax prediction precision of a single object by detecting the object in the original image and aligning the object with the parallax image.
Drawings
Figure 1 is an overall flow chart of the present invention,
figure 2 is a schematic diagram of the loss of binocular matching,
fig. 3 is a schematic diagram of finding a reasonable domain by adaptation.
Detailed Description
The invention is explained in detail below with reference to the drawings, and the specific steps are as follows.
Example 1: as shown in fig. 1, a monocular scene depth prediction method based on depth learning, using a calibrated color image pair as a training input, using a DenseNet convolution module to improve a network architecture of an encoder part, strengthening loss constraints at multiple levels of binocular stereo matching and image smoothing, improving occlusion problems by post-processing, and generally improving monocular image depth prediction effect, includes the following steps:
step 1: and preprocessing operation, namely resizing the high-resolution binocular color image pair to be 256x512, and performing random turning and contrast transformation on the image pair with unified size to perform a plurality of combined data enhancement transformations to increase the amount of input data, and then inputting the input data into an encoder of a convolutional network.
Step 2: the network encoder part extracts visual features by using a DenseNet-based convolution module, improves the transmission of information and gradient in a network through dense connection, relieves the problem of gradient disappearance and strengthens feature propagation;
and step 3: setting a skip action domain in a decoder part, directly splicing a part of feature maps in the encoding process into the decoding process, using 64 7x7 convolution kernels to realize up-sampling, and using a sigmoid function as an activation function to generate parallax;
and 4, step 4: the binocular matching loss and the depth smoothing loss are enhanced, the model is optimized in iteration, the prediction precision is improved, and the edge of the depth map is kept while the depth map is smoothed.
And 5: the post-processing part is optimized. On one hand, due to the existence of stereo occlusion, the left view can see more content on the left side, and the information of the right part of an object is easy to lose, so that the input image is turned over and a corresponding disparity map is generated, the disparity map is combined with the disparity map of the original image, proper edge information is selected, the output disparity is optimized, and the problem of occlusion of the image edge is solved; on the other hand, based on the object detection technology, the output disparity map is corrected by using the original image, so that the object edge in the scene is highlighted, and the smear is effectively eliminated.
Step 2 is specifically as follows, the training data set of the depth prediction model is a calibrated binocular color image pair, the size of the binocular color image pair is adjusted to be 256x512, the binocular color image pair is used as the input of the network, and the binocular color image pair is subjected to 64-times 7x7 convolution kernel convolution and maximum pooling once to obtain the binocular color image pair
Figure BDA0002527628800000061
The size of the tenar, then into four modules consisting of a denseblock and a transition layer;
the number of layers included in the four denseblocks is respectively 2, 6, 12 and 24, the number of network layers is continuously deepened, the growth rate (growth _ rate) of the denseblocks in all the denseblocks is set to be 32, the default bn _ size is 4, 1x1 convolution is added to the bottom layer in the dense blocks before 3x3 convolution, the purpose is to reduce the number of parameters of the network, the transition layer integrates global information between every two dense blocks by using average pooling, and the compactness of the model is improved. The transmission of information and gradient in the network is improved through dense connection, the network is deepened, the problem of gradient disappearance is relieved, and feature propagation is enhanced. The step 4 is specifically as follows: in order to optimize the model and obtain a more accurate dense depth map, binocular matching loss and depth smoothing loss are added, and the constraint of a loss function on the network is strengthened:
(4.1) binocular matching loss: the matching cost calculation is an important measurement standard of a stereo matching algorithm, and the similarity of the relative views of the reconstructed image and the sampled image is compared by utilizing the correlation between the pixels of the binocular image, so that the stereo matching can be enhanced.
Under the condition of spatial similarity, strong correlation exists between pixels of RGB images, and the original left image is assumed to be
Figure BDA0002527628800000062
(i, j represents the position coordinates of the pixel), and the reconstructed left image can be obtained through the distortion operation according to the predicted parallax and the original right image, wherein the left image is obtained through the distortion operation
Figure BDA0002527628800000063
According to the corresponding parallax value of each pixel point of the left image, the corresponding pixel point is searched in the right image and then interpolation is carried out, the combination of L1 and a single-scale SSIM item is used as the luminosity cost Cap of image reconstruction, and the reconstructed view is calculated
Figure BDA0002527628800000064
View with real input
Figure BDA0002527628800000065
The matching cost of (a) is low,
Figure BDA0002527628800000066
at the parallax level, the invention tries to make the left viewpoint parallax map equal to the projected right viewpoint parallax map, specifically d taking the right map as the reference imagerD as input image for reconstruction operation, and the left image as reference imagelAs an input disparity map, d is obtained through a warping operationlIs reconstructed to obtain a disparity map
Figure BDA0002527628800000067
ClrFacilitating predicted left-right visionThe difference is consistent with each other,
Figure BDA0002527628800000068
(4.2) depth smoothing loss: when an edge exists in an image, a large gradient value is certain, conversely, when a smooth part exists in the image, the gray value change is small, the corresponding gradient is also small, the gray value change of the edge of the directly predicted parallax image (depth image) contour is obvious, the layering sense is strong, the problem of discontinuous depth often occurs at the image gradient, and when the occlusion problem is met, a necessary object boundary is also kept;
since a dense disparity map is required, the disparity gradient can be smoothed locally for the purpose of disparity
Figure BDA0002527628800000071
An L1 penalty is made, however, this assumption penalizes edges incorrectly and does not apply to object boundaries that typically correspond to regions of high variation in pixel intensity. Thus, by introducing an edge perception term e, image gradients are used
Figure BDA0002527628800000072
The influence of error punishment is reduced by the edge perception item,
Figure BDA0002527628800000073
on the basis of introducing the edge perception item e, a method for finding a field based on cross self-adaptation is added, the idea is to limit the support field of a certain pixel point p and reduce the influence brought by the punishment of errors on the edge. For the point p, four arms respectively extend from the upper, lower, left and right sides, and the point meeting the conditions is accommodated in the supporting area U of the point ppThe iteration is continued until the termination when the following conditions are not met. (i.e. the strength must not exceed too much, the distance must not be too far)
Figure BDA0002527628800000074
. In the scheme, in the binocular matching loss part, the matching loss between binocular image pairs is enhanced based on a reconstruction technology, and the constraint of a parallax layer is increased on the basis of an RGB layer. And according to the original right viewpoint RGB image and the predicted disparity map, warping to obtain a corresponding reconstructed left image, and comparing the reconstructed left RGB image and disparity map with the original left image and predicted disparity map to obtain binocular matching loss.
In the depth smoothing loss part, the depth map is smoothed by the method, and the depth change of the object edge and the occlusion is kept. Performing smoothing operation on part of the disparity map, introducing an edge perception item e to reduce the influence caused by error punishment for the object boundary of a pixel intensity high-change area, and introducing a cross-based self-adaptive domain finding method to limit the support domain U of a certain pixel point ppMake UpAll pixel points in the set must not exceed too much intensity and not too far away from point p, and the iteration is terminated until the following condition is not met.
And (5) specifically processing as follows, and improving the generated disparity map. In the aspect of relieving the occlusion of the image edge, for the input picture I, not only the disparity map D is calculatedlAnd calculating a disparity map D for the mirror-inverted image I' of the picture Il' and then inverting the disparity map to obtain a disparity map Dl″,Dl"and DlAnd (4) aligning. Binding of Dl"left 5% and DlThe right 5% of the image and the average between the two constitutes the final result to reduce the effect of stereo occlusion at the edges of the image. Meanwhile, a smear part for eliminating the edge of an object is added, the object possibly existing in the original image scene is identified based on an object identification technology and is aligned with the output disparity map, the pixels at the edge of the object in the disparity map are enhanced, the pixels at the smear part are set as the average value of the field pixels for eliminating the edge, the object in the scene is highlighted, and the quality of the disparity map is improved. In this scheme, due to the existence of stereo occlusion, the left view can see more content on the left side, and the information on the right part of the object is easily lost, so for the input picture I, not only the disparity map thereof is calculatedDlAnd calculating a disparity map D for the mirror-inverted image I' of the picture Il' selecting the appropriate edge information of the two disparity maps, and optimizing the output disparity to reduce the influence of stereo occlusion at the image edge.
Because a moving object may exist in a scene, a fuzzy part exists in an input image, and in order to highlight the object in the scene, the object identification technology is used for identifying the object possibly existing in the original image scene, aligning the object with an output disparity map, enhancing the pixels at the edge of the object in the disparity map, setting the pixels at the smear part as the average value of the pixels in the field of edge removal, eliminating the smear phenomenon to a certain extent, and improving the quality of the disparity map.
The application example is as follows:
as shown in fig. 1, the present invention mainly includes three parts, namely, improvement of encoder module of the disparity prediction network, enhancement of the loss function algorithm, and optimization of post-processing, which are described in detail below for each part:
1. DenseNet module-based ground depth prediction network structure
The network inputs 256x512 binocular RGB image pairs after resizing, and in the encoder part of the network, conv and maximum pooling maxpool which are subjected to one-time convolution of 64 7x7 convolution kernels become 1/4 in image size and the number of channels is 64.
The data then enters four modules consisting of a denseblock and transition layers, the growth rate (growth _ rate) of a denselayer in all the denseblocks is set to 32, the default bn _ size is 4, the cottleneck layer of denselayer is increased by Conv1x1, 128 channels (bn _ size growth _ rate) are output, and 32 channels (growth _ rate) are output at Conv3x 3. The transition layer uses the average pooling integrated global information and sets the number of output channels to be half of the number of input channels.
2. Performing algorithm optimization on binocular matching loss and depth smoothing loss respectively
(1) Binocular matching loss calculation is an important metric for stereo matching algorithms. Based on a reconstruction technology, the image reconstruction technology is used in RGB and parallax layers, the similarity of the relative views of a reconstructed image and a sampled image is compared, and the calculation cost is aggregated:
use of a combination of L1 and single-scale SSIM terms as the photometric cost C for image reconstructionapThe image that facilitates reconstruction is visually similar to the training input. Assume the original left image is
Figure BDA0002527628800000081
(i, j represents the position coordinates of the pixel points), and a reconstructed left image can be obtained through distortion operation according to the predicted parallax and the original right image. Resulting from the twisting operation herein
Figure BDA0002527628800000082
According to the corresponding parallax value of each pixel point of the left image, the corresponding pixel point is searched in the right image and then interpolation is carried out to obtain the parallax value. CapBy comparing input artwork
Figure BDA0002527628800000083
And reconstructing the picture
Figure BDA0002527628800000084
It follows that a simplified SSIM is used which contains a 3x3 block filter instead of a gaussian filter, where N is the number of pixels:
Figure BDA0002527628800000085
in order to generate more accurate disparity maps, the network is trained to predict the disparity of the left and right images, and only the left view is used as the input of the convolution part of the network, and the disparity map d which not only takes the left image as a reference image is outputlAnd a parallax map d using the right image as a reference imager. To ensure consistency, a L1 left and right disparity consistency loss was introduced as part of the model. This cost attempts to make the left view disparity map equal to the projected right view disparity map, specifically d, which will be referenced to the right maprD as input image for reconstruction operation, and the left image as reference imagelAs an input disparity map, d is obtained through a warping operationlIs reconstructed to obtain a disparity map
Figure BDA0002527628800000091
Note that here we get a reconstructed disparity map, not a reconstructed left map, so the left-right disparity consistency loss can be written as ClrThe predicted left-right disparity is facilitated to be consistent.
Figure BDA0002527628800000092
Where N is the number of pixels,
Figure BDA0002527628800000093
i.e. reconstructed
Figure BDA0002527628800000094
As with all other terms, this cost also corresponds to the disparity map for the right viewpoint, which can be predicted at all output scales.
(2) The gray level of the outline edge of the predicted parallax image is obviously changed, the layering sense is strong, and the problem of discontinuous depth often occurs at the image gradient. Since dense disparity maps are required here, it is necessary not only to keep the smoothness locally, but also to retain information on object edges and occlusions.
For parallax to remain locally smooth, this document deals with parallax gradients
Figure BDA0002527628800000095
An L1 penalty is imposed. However, this assumption penalizes edges erroneously and is not applicable to object boundaries that typically correspond to regions of high variation in pixel intensity. Thus, by introducing an edge perception term e, image gradients are used
Figure BDA0002527628800000096
The influence caused by error punishment is reduced through the edge perception item. Parallax smoothing loss
Figure BDA0002527628800000097
The definition is as follows:
Figure BDA0002527628800000098
here, the parallax gradients in the x-direction and the y-direction are used respectively
Figure BDA0002527628800000099
And image gradient
Figure BDA00025276288000000910
On the basis of introducing the edge perception item e, a method for finding a field based on cross self-adaptation is added, the idea is to limit the support field of a certain pixel point p and reduce the influence brought by the punishment of errors on the edge. For the point p, four arms respectively extend from the upper, lower, left and right sides, and the point meeting the conditions is accommodated in the supporting area U of the point ppThe iteration is continued until the termination when the following conditions are not met. (i.e. the strength must not exceed too much, the distance must not be too far)
Figure BDA00025276288000000911
3. In the post-processing part, the problem of shielding the image edge is relieved, the smear phenomenon of a moving object is effectively eliminated, and the output disparity map is optimized.
Because of the existence of stereo occlusion, the left view can see more content on the left side, and the information of the right part of the object is easy to lose, therefore, for the input picture I, not only the disparity map D is calculatedlAnd calculating a disparity map D for the mirror-inverted image I' of the picture Il' selecting the appropriate edge information of the two disparity maps, and optimizing the output disparity to reduce the influence of stereo occlusion at the image edge.
Because a moving object may exist in a scene, a fuzzy part exists in an input image, and in order to highlight the object in the scene, the object identification technology is used for identifying the object possibly existing in the original image scene, aligning the object with an output disparity map, enhancing the pixels at the edge of the object in the disparity map, setting the pixels at the smear part as the average value of the pixels in the field of edge removal, eliminating the smear phenomenon to a certain extent, and improving the quality of the disparity map.
It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and all equivalent substitutions or substitutions made on the above-mentioned technical solutions belong to the scope of the present invention.

Claims (4)

1. A monocular scene depth prediction method based on deep learning is characterized by comprising the following steps:
step 1: and preprocessing operation, namely resizing the high-resolution binocular color image pair to be 256x512, and performing random turning and contrast transformation on the image pair with unified size to perform a plurality of combined data enhancement transformations to increase the amount of input data, and then inputting the input data into an encoder of a convolutional network.
Step 2: the network encoder part extracts visual features by using a DenseNet-based convolution module, improves the transmission of information and gradient in a network through dense connection, relieves the problem of gradient disappearance and strengthens feature propagation;
and step 3: setting a skip action domain in a decoder part, directly splicing a part of feature maps in the encoding process into the decoding process, using 64 7x7 convolution kernels to realize up-sampling, and using a sigmoid function as an activation function to generate parallax;
and 4, step 4: the binocular matching loss and the depth smoothing loss are enhanced, the model is optimized in iteration, the prediction precision is improved, and the edge of the depth map is kept while the depth map is smoothed;
and 5: the post-processing part is optimized, the input image is inverted to generate corresponding parallax, the parallax of the original image is combined with the parallax of the original image to relieve the problem of edge shielding, the original image is aligned with the parallax image based on an object detection technology, pixels at the edge of the object are enhanced, and the phenomenon of smear can be eliminated to a certain extent.
2. The method of claim 1, wherein the step 2 is specifically as follows, and the depth prediction isThe training data set of the model is a calibrated binocular color image pair, the size of the binocular color image pair is adjusted to be 256x512, the binocular color image pair is used as the input of the network, and the binocular color image pair is subjected to 64 times of 7x7 convolution kernel convolution and maximum pooling once to obtain the binocular color image pair
Figure FDA0002527628790000011
The size of the tenar, then into four modules consisting of a denseblock and a transition layer;
the number of layers included in the four denseblocks is 2, 6, 12 and 24 respectively, the number of network layers is continuously increased, the growth rate (growth _ rate) of the denseblocks in all the denseblocks is set to be 32, the default bn _ size is set to be 4, and the Bottleneck layer in the dense block is added with convolution of 1x1 before convolution of 3x 3.
3. The method for predicting the depth of a monocular scene based on deep learning according to claim 1, wherein the step 4 is specifically as follows:
(3.1) binocular matching loss: the matching cost calculation is an important measurement standard of a stereo matching algorithm, and by utilizing the correlation between pixels of a binocular image,
under the condition of spatial similarity, strong correlation exists between pixels of RGB images, and the original left image is assumed to be
Figure FDA0002527628790000012
(i, j represents the position coordinates of the pixel points), and obtaining a reconstructed left image through a warping operation according to the predicted parallax and the original right image, wherein the left image is obtained through the warping operation
Figure FDA0002527628790000013
According to the corresponding parallax value of each pixel point of the left image, the corresponding pixel point is searched in the right image and then interpolation is carried out, the combination of L1 and a single-scale SSIM item is used as the luminosity cost Cap of image reconstruction, and the reconstructed view is calculated
Figure FDA0002527628790000014
View with real input
Figure FDA0002527628790000015
The matching cost of (a) is low,
Figure FDA0002527628790000021
on the parallax level, the left viewpoint parallax image is equal to the projected right viewpoint parallax image, and d is taken as the reference image of the right imagerD as input image for reconstruction operation, and the left image as reference imagelAs an input disparity map, d is obtained through a warping operationlIs reconstructed to obtain a disparity map
Figure FDA0002527628790000022
ClrFacilitates the consistency of the predicted left-right disparity,
Figure FDA0002527628790000023
(3.2) depth smoothing loss: using image gradients by introducing an edge-perception term e
Figure FDA0002527628790000026
The influence of error punishment is reduced by the edge perception item,
Figure FDA0002527628790000024
on the basis of introducing the edge perception item e, a method for finding the field based on cross self-adaptation is added, four arms are respectively extended from the upper part, the lower part, the left part and the right part of the point p, and the point meeting the conditions is accommodated in a supporting area U of the point ppIterating until the iteration is terminated when the following conditions are not met;
Figure FDA0002527628790000025
4. the method of claim 1, wherein the step (5) is specifically to improve the generated disparity map, and to reduce the occlusion of the image edge, not only the disparity map D of the input picture I is calculatedlAnd calculating a disparity map D for the mirror-inverted image I' of the picture Il' and then inverting the disparity map to obtain a disparity map Dl″,Dl"and DlAligning; binding of Dl"left 5% and DlThe right 5% of the image and the average between the two constitutes the final result to reduce the effect of stereo occlusion at the edges of the image.
CN202010508803.3A 2020-06-06 2020-06-06 Monocular scene depth prediction method based on deep learning Active CN111899295B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010508803.3A CN111899295B (en) 2020-06-06 2020-06-06 Monocular scene depth prediction method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010508803.3A CN111899295B (en) 2020-06-06 2020-06-06 Monocular scene depth prediction method based on deep learning

Publications (2)

Publication Number Publication Date
CN111899295A true CN111899295A (en) 2020-11-06
CN111899295B CN111899295B (en) 2022-11-15

Family

ID=73208030

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010508803.3A Active CN111899295B (en) 2020-06-06 2020-06-06 Monocular scene depth prediction method based on deep learning

Country Status (1)

Country Link
CN (1) CN111899295B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112561979A (en) * 2020-12-25 2021-03-26 天津大学 Self-supervision monocular depth estimation method based on deep learning
CN113140011A (en) * 2021-05-18 2021-07-20 烟台艾睿光电科技有限公司 Infrared thermal imaging monocular vision distance measurement method and related assembly
CN114022529A (en) * 2021-10-12 2022-02-08 东北大学 Depth perception method and device based on self-adaptive binocular structured light
CN114119698A (en) * 2021-06-18 2022-03-01 湖南大学 Unsupervised monocular depth estimation method based on attention mechanism
CN115184016A (en) * 2022-09-06 2022-10-14 江苏东控自动化科技有限公司 Elevator bearing fault detection method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163246A (en) * 2019-04-08 2019-08-23 杭州电子科技大学 The unsupervised depth estimation method of monocular light field image based on convolutional neural networks
CN110310317A (en) * 2019-06-28 2019-10-08 西北工业大学 A method of the monocular vision scene depth estimation based on deep learning
CN110490919A (en) * 2019-07-05 2019-11-22 天津大学 A kind of depth estimation method of the monocular vision based on deep neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163246A (en) * 2019-04-08 2019-08-23 杭州电子科技大学 The unsupervised depth estimation method of monocular light field image based on convolutional neural networks
CN110310317A (en) * 2019-06-28 2019-10-08 西北工业大学 A method of the monocular vision scene depth estimation based on deep learning
CN110490919A (en) * 2019-07-05 2019-11-22 天津大学 A kind of depth estimation method of the monocular vision based on deep neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周云成等: "基于自监督学习的番茄植株图像深度估计方法", 《农业工程学报》 *
黄军等: "单目深度估计技术进展综述", 《中国图象图形学报》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112561979A (en) * 2020-12-25 2021-03-26 天津大学 Self-supervision monocular depth estimation method based on deep learning
CN113140011A (en) * 2021-05-18 2021-07-20 烟台艾睿光电科技有限公司 Infrared thermal imaging monocular vision distance measurement method and related assembly
CN113140011B (en) * 2021-05-18 2022-09-06 烟台艾睿光电科技有限公司 Infrared thermal imaging monocular vision distance measurement method and related components
CN114119698A (en) * 2021-06-18 2022-03-01 湖南大学 Unsupervised monocular depth estimation method based on attention mechanism
CN114119698B (en) * 2021-06-18 2022-07-19 湖南大学 Unsupervised monocular depth estimation method based on attention mechanism
CN114022529A (en) * 2021-10-12 2022-02-08 东北大学 Depth perception method and device based on self-adaptive binocular structured light
CN114022529B (en) * 2021-10-12 2024-04-16 东北大学 Depth perception method and device based on self-adaptive binocular structured light
CN115184016A (en) * 2022-09-06 2022-10-14 江苏东控自动化科技有限公司 Elevator bearing fault detection method

Also Published As

Publication number Publication date
CN111899295B (en) 2022-11-15

Similar Documents

Publication Publication Date Title
CN111899295B (en) Monocular scene depth prediction method based on deep learning
CN110738697B (en) Monocular depth estimation method based on deep learning
CN109377530B (en) Binocular depth estimation method based on depth neural network
CN108648161B (en) Binocular vision obstacle detection system and method of asymmetric kernel convolution neural network
Lee et al. Local disparity estimation with three-moded cross census and advanced support weight
CN106408513B (en) Depth map super resolution ratio reconstruction method
CN115205489A (en) Three-dimensional reconstruction method, system and device in large scene
CN104954780A (en) DIBR (depth image-based rendering) virtual image restoration method applicable to high-definition 2D/3D (two-dimensional/three-dimensional) conversion
CN115035171B (en) Self-supervision monocular depth estimation method based on self-attention guide feature fusion
CN113610912B (en) System and method for estimating monocular depth of low-resolution image in three-dimensional scene reconstruction
Liu et al. High quality depth map estimation of object surface from light-field images
CN109741358B (en) Superpixel segmentation method based on adaptive hypergraph learning
CN114996814A (en) Furniture design system based on deep learning and three-dimensional reconstruction
CN114677479A (en) Natural landscape multi-view three-dimensional reconstruction method based on deep learning
CN111369435B (en) Color image depth up-sampling method and system based on self-adaptive stable model
CN113421210A (en) Surface point cloud reconstruction method based on binocular stereo vision
CN115115860A (en) Image feature point detection matching network based on deep learning
Zhu et al. Hybrid scheme for accurate stereo matching
Pan et al. An automatic 2D to 3D video conversion approach based on RGB-D images
CN115272450A (en) Target positioning method based on panoramic segmentation
CN107194931A (en) It is a kind of that the method and system for obtaining target depth information is matched based on binocular image
CN111985535A (en) Method and device for optimizing human body depth map through neural network
Hou et al. De‐NeRF: Ultra‐high‐definition NeRF with deformable net alignment
CN115937011B (en) Key frame pose optimization visual SLAM method, storage medium and equipment based on time lag feature regression
CN116129036B (en) Depth information guided omnidirectional image three-dimensional structure automatic recovery method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant