Nothing Special   »   [go: up one dir, main page]

CN111899295B - Monocular scene depth prediction method based on deep learning - Google Patents

Monocular scene depth prediction method based on deep learning Download PDF

Info

Publication number
CN111899295B
CN111899295B CN202010508803.3A CN202010508803A CN111899295B CN 111899295 B CN111899295 B CN 111899295B CN 202010508803 A CN202010508803 A CN 202010508803A CN 111899295 B CN111899295 B CN 111899295B
Authority
CN
China
Prior art keywords
image
parallax
depth
disparity map
edge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010508803.3A
Other languages
Chinese (zh)
Other versions
CN111899295A (en
Inventor
姚莉
缪静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202010508803.3A priority Critical patent/CN111899295B/en
Publication of CN111899295A publication Critical patent/CN111899295A/en
Application granted granted Critical
Publication of CN111899295B publication Critical patent/CN111899295B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Image Processing (AREA)

Abstract

A monocular scene depth prediction method based on deep learning is suitable for monocular pictures or videos, a calibrated binocular color image is used for training a depth prediction model, a DenseNet convolution module is used for extracting a feature space in a network architecture, and dense blocks and transition layers in the feature space are used for directly connecting each layer in the network with the previous layer so as to realize the repeated utilization of features; the binocular matching loss is improved, the depth prediction problem is regarded as the image reconstruction problem, the color image and the parallax image of the input left viewpoint are sampled to generate a virtual color image and a parallax image, and the consistency of the generated virtual view and the corresponding input right viewpoint image is restrained at an RGB level and a parallax level by utilizing a stereo matching algorithm of a binocular image pair, so that better depth is obtained; the depth smoothing loss is improved, the high-quality dense depth map can be generated, the problem of artifacts caused by shielding in monocular scene depth prediction is effectively solved, and the requirements of converting 2D (two-dimensional) to 3D (three-dimensional) of a plurality of indoor and outdoor real scenes can be met.

Description

Monocular scene depth prediction method based on deep learning
Technical Field
The invention relates to a monocular scene depth prediction method based on deep learning, and belongs to the field of computer vision and image processing.
Background
Monocular depth prediction is a research topic of great interest in computer vision, and has wide application value in the fields of automatic driving, VR game production, movie production and the like. However, there are still many problems to be solved in this field, such as:
1) The process of collecting depth data by using radar laser consumes a lot of energy and is greatly influenced by weather;
2) The generated depth image generates artifacts and smear due to illumination shadow or occlusion in the original image;
3) The depth information method based on sparse depth map recovery has the problem of discontinuous edge depth;
4) The depth model is not completely differentiable, and the calculability of the gradient in optimization is lost, so that the training is suboptimal;
5) The image generation model does not have the capability of scaling to images with large output resolution;
6) The generalization capability of the model is generally limited by training data.
Disclosure of Invention
The invention provides a monocular scene depth prediction method based on deep learning to solve the problems, the method is suitable for monocular pictures or videos, the method can obtain the dense depth map of the scene image, the accuracy rate can reach 91%, and the depth prediction requirements of a plurality of indoor and outdoor real scenes can be met.
In order to achieve the purpose, the technical scheme of the invention is as follows: a monocular scene depth prediction method based on deep learning takes a calibrated color image pair as training input, a DenseNet convolution module is used for improving the network architecture of an encoder part, loss constraint is strengthened at a plurality of layers of binocular stereo matching and image smoothing, occlusion problem is improved by post-processing, and the monocular image depth prediction effect is improved generally, the method comprises the following steps:
step 1: and preprocessing operation, namely resizing the high-resolution binocular color image pair to be 256x512, and performing random turning and contrast transformation on the image pair with unified size to perform a plurality of combined data enhancement transformations to increase the amount of input data, and then inputting the input data into an encoder of a convolutional network.
Step 2: the network encoder part extracts visual features by using a DenseNet-based convolution module, improves the transmission of information and gradients in a network through dense connection, relieves the problem of gradient disappearance and strengthens feature propagation;
and step 3: setting a skip action domain on a decoder part, directly splicing a part of feature maps in the encoding process into the decoding process, using 64 7x7 convolution kernels to realize up-sampling, and using a sigmoid function as an activation function to generate parallax;
and 4, step 4: the binocular matching loss and the depth smoothing loss are enhanced, the model is optimized in iteration, the prediction precision is improved, and the edge of the depth map is kept while the depth map is smoothed;
and 5: the post-processing part is optimized. On one hand, due to the existence of stereo occlusion, the left view can see more content on the left side, and the information of the right part of an object is easy to lose, so that the input image is turned over and a corresponding disparity map is generated, the disparity map is combined with the disparity map of the original image, proper edge information is selected, the output disparity is optimized, and the problem of occlusion of the image edge is solved; on the other hand, based on the object detection technology, the output disparity map is corrected by using the original image, so that the object edge in the scene is highlighted, and the smear is effectively eliminated.
As an improvement of the present invention, step 2 is specifically as follows, the training data set of the depth prediction model is a pair of calibrated binocular color images, and the size is adjusted to 256 × 512As the input of the network, obtaining the maximum pool after 64 times of convolution kernel convolution and maximum pooling
Figure BDA0002527628800000021
The size of the tenar, then into four modules consisting of a denseblock and a transition layer;
the number of layers contained in the four densebocks is respectively 2,6, 12 and 24, the number of network layers is continuously increased, the growth rate (growth _ rate) of deneblocks in all denebocks is set to be 32, the default bn _ size is 4, 1x1 convolution is added before 3x3 convolution in the Bottleneck layers in the dense blocks, the purpose is to reduce the number of parameters of the network, the transition layers are arranged between every two dense blocks, the global information is integrated by using average pooling, and the compactness of the model is improved. The transmission of information and gradient in the network is improved through dense connection, the problem of gradient disappearance is relieved while the network is deepened, and the characteristic propagation is enhanced.
As an improvement of the present invention, the step 4 specifically includes the following steps: in order to optimize the model and obtain a more accurate dense depth map, binocular matching loss and depth smoothing loss are added, and the constraint of a loss function on the network is strengthened:
(4.1) binocular matching loss: the matching cost calculation is an important measurement standard of a stereo matching algorithm, and the similarity of the relative views of the reconstructed image and the sampled image is compared by utilizing the correlation between the pixels of the binocular image, so that the stereo matching can be enhanced.
Under the condition of spatial similarity, strong correlation exists between pixels of RGB images, and the original left image is assumed to be
Figure BDA0002527628800000022
(i, j represents the position coordinates of the pixel point), and according to the predicted parallax and the original right image, a reconstructed left image can be obtained through a warping operation, wherein the left image is obtained through the warping operation
Figure BDA0002527628800000023
Is based on the left graph eachThe parallax value corresponding to each pixel point is obtained by searching the corresponding pixel point in the right image and then interpolating, the combination of the L1 and the single-scale SSIM item is used as the luminosity cost Cap of image reconstruction, and the reconstructed view is calculated
Figure BDA0002527628800000024
View with real input
Figure BDA0002527628800000025
The matching cost of (a) is low,
Figure BDA0002527628800000026
at the parallax level, the invention tries to make the left viewpoint parallax map equal to the projected right viewpoint parallax map, specifically d taking the right map as the reference image r D as input image for reconstruction operation, and the left image as reference image l As the input disparity map, d is obtained through the distortion operation l Is reconstructed to obtain a disparity map
Figure BDA0002527628800000027
C lr Facilitates the consistency of the predicted left-right disparity,
Figure BDA0002527628800000028
(4.2) depth smoothing loss: when an edge exists in an image, a large gradient value is certain, conversely, when a smooth part exists in the image, the gray value change is small, the corresponding gradient is also small, the gray value change of the edge of the directly predicted parallax image (depth image) contour is obvious, the layering sense is strong, the problem of discontinuous depth often occurs at the image gradient, and when the occlusion problem is met, a necessary object boundary is also kept;
since a dense disparity map is required, disparity gradients can be aligned in order for the disparity to remain locally smooth
Figure BDA0002527628800000031
An L1 penalty is made, however, this assumption falsely penalizes edges and is not applicable to object boundaries that typically correspond to regions of high variation in pixel intensity. Thus, by introducing an edge perception term e, image gradients are used
Figure BDA0002527628800000032
The influence of error punishment is reduced through the edge perception item,
Figure BDA0002527628800000033
on the basis of introducing the edge perception item e, a method for finding a field based on cross self-adaptation is added, the idea is to limit the support field of a certain pixel point p and reduce the influence brought by the punishment of errors on the edge. For the point p, four arms respectively extend from the upper, lower, left and right sides to accommodate the point meeting the conditions into the supporting area U of the point p p The iteration is continued until the termination when the following conditions are not met. (i.e. the strength must not exceed too much, the distance must not be too far away)
Figure BDA0002527628800000034
. In the scheme, in the binocular matching loss part, the matching loss between binocular image pairs is enhanced based on a reconstruction technology, and the constraint of a parallax layer is increased on the basis of an RGB layer. And according to the original right viewpoint RGB image and the predicted disparity map, warping to obtain a corresponding reconstructed left image, and comparing the reconstructed left RGB image and disparity map with the original left image and predicted disparity map to obtain binocular matching loss.
In the depth smoothing loss part, the depth map is smoothed by the method, and the depth change of the object edge and the occlusion is kept. The method is characterized in that smoothing operation is carried out on part of a disparity map, and simultaneously, for the object boundary of a region with high pixel intensity variation, an edge perception item e is introduced to reduce the influence caused by error punishment, and the method also introduces the method based on cross self-adaptation to find the fieldMethod for limiting the support field U of a certain pixel point p p Make U p All pixel points in the image must not exceed too much intensity and not be too far away from point p, and the iteration is terminated until the following condition is not met.
As an improvement of the present invention, the step (5) is specifically to improve the generated disparity map as follows. In the aspect of alleviating the occlusion of the image edge, for the input picture I, not only the disparity map D is calculated l And calculating a disparity map D for the mirror-inverted image I' of the picture I l ' and then inverting the disparity map to obtain a disparity map D l ″,D l "and D l And (4) aligning. Binding of D l "left 5% and D l The right 5% of the image and the average between the two constitutes the final result to reduce the effect of stereo occlusion at the edges of the image. Meanwhile, a smear part for eliminating the edge of an object is added, the object possibly existing in the original image scene is identified based on an object identification technology and is aligned with the output disparity map, the pixels at the edge of the object in the disparity map are enhanced, the pixels at the smear part are set as the average value of the field pixels for eliminating the edge, the object in the scene is highlighted, and the quality of the disparity map is improved. In this scheme, due to the existence of stereo occlusion, the left view can see more content on the left side, and the information on the right part of the object is easily lost, so that for the input picture I, not only is the disparity map D calculated l And calculating a disparity map D for the mirror-inverted image I' of the picture I l ' selecting the proper edge information of the two disparity maps, and optimizing the output disparity to reduce the influence of stereo occlusion at the image edge.
Because a moving object may exist in a scene, a fuzzy part exists in an input image, and in order to highlight the object in the scene, the object identification technology is used for identifying the object possibly existing in the original image scene, aligning the object with an output disparity map, enhancing the pixels at the edge of the object in the disparity map, setting the pixels at the smear part as the average value of the pixels in the field of edge removal, eliminating the smear phenomenon to a certain extent, and improving the quality of the disparity map.
Because a moving object may exist in a scene, a fuzzy part exists in an input image, and in order to highlight the object in the scene, the object identification technology is used for identifying the object possibly existing in the original image scene, aligning the object with an output disparity map, enhancing the pixels at the edge of the object in the disparity map, setting the pixels at the smear part as the average value of the pixels in the field of edge removal, eliminating the smear phenomenon to a certain extent, and improving the quality of the disparity map.
The technical scheme is suitable for monocular pictures or videos, the calibrated binocular color images are used for training a depth prediction model, a DenseNet convolution module is used for extracting a feature space in a network architecture, and dense blocks and transition layers are used, so that each layer in the network is directly connected with the previous layer, and the feature is recycled; the binocular matching loss is improved, the depth prediction problem is regarded as the image reconstruction problem, the color image and the parallax image of the input left viewpoint are sampled to generate a virtual color image and a parallax image, and the consistency of the generated virtual view and the corresponding image of the input right viewpoint is restrained at an RGB level and a parallax level by utilizing a stereo matching algorithm of a binocular image pair, so that better depth is obtained; the depth smoothing loss is improved, the L1 punishment is carried out on a smoother part or a discontinuous depth region in an image, for an object boundary or a shielding part of a high-pixel-intensity variation region, a method for finding a field based on cross self-adaption is added on the basis of introducing an edge perception item e, the support field of a certain pixel point p is limited, and the influence brought by wrong punishment edges is reduced; in post-processing optimization, based on an object detection technology, an original image is used for correcting an output disparity map, the edge of an object in a scene is highlighted, and smearing is effectively eliminated; in order to produce better effect on the right side of an object, an input original image is turned to generate a corresponding turning disparity map, the 5% of the original disparity map and the turning disparity map are combined, and the average value of the two is taken in the middle part to form a final result; because the disparity map predicted by the moving object may have blur and smear, in order to highlight the object in the scene, the invention identifies the object possibly existing in the original image scene based on the object identification technology, aligns with the output disparity map, enhances the pixels at the edge of the object in the disparity map, sets the pixels at the smear part as the average value of the field pixels of the edge removal, eliminates the smear phenomenon to a certain extent, and improves the quality of the disparity map. The method can generate the high-quality dense depth map, effectively improve the problem of artifacts caused by shielding in monocular scene depth prediction, and meet the requirement of converting 2D (two-dimensional) to 3D (three-dimensional) of a plurality of indoor and outdoor real scenes.
Compared with the prior art, the invention has the following advantages: 1) The method improves the convolution module on the basis of obtaining the scene depth prediction model by utilizing the binocular image pair training. By utilizing the characteristic that the feature graph in the DenseNet module is densely linked, each layer receives the feature input of the preamble layer and outputs the feature of the layer to all subsequent layers, thereby reducing the information, reducing the consumption in the transmission process and improving the depth prediction precision; 2) In the binocular matching loss part, a reconstruction technology is utilized, left-right parallax consistency constraint is added on the basis of binocular RGB image structure similarity constraint, and the advantage of using binocular images as training data is fully utilized; 3) According to the method, in a parallax smooth loss part, parallax smooth operation is carried out on a part of a parallax image to obtain a smooth depth image, meanwhile, an edge sensing item and a method of a self-adaptive finding field are introduced, image and sheltered edge information are reserved, and a clearer depth image is obtained; 4) The invention optimizes the edge of the output disparity map, and obtains more information of the right edge of the image by inverting the mirror surface of the input image, thereby effectively relieving the problem of shielding; 5) The method can effectively eliminate the smear problem in the parallax image, and can enhance the pixel value of the edge of the object and improve the parallax prediction precision of a single object by detecting the object in the original image and aligning the object with the parallax image.
Drawings
Figure 1 is an overall flow chart of the present invention,
figure 2 is a schematic view of the binocular matching loss,
fig. 3 is a schematic diagram of finding a reasonable domain by adaptation.
Detailed Description
The invention is explained in detail with reference to the drawings, and the specific steps are as follows.
Example 1: as shown in fig. 1, a monocular scene depth prediction method based on deep learning, using a calibrated color image pair as a training input, using a DenseNet convolution module to improve a network architecture of an encoder part, strengthening loss constraints at multiple levels of binocular stereo matching and image smoothing, improving occlusion problems by post-processing, and generally improving monocular image depth prediction effect, includes the following steps:
step 1: and preprocessing operation, namely resizing the high-resolution binocular color image pair to be 256x512, and performing random turning and contrast transformation on the image pair with unified size to perform a plurality of combined data enhancement transformations to increase the amount of input data, and then inputting the input data into an encoder of a convolutional network.
Step 2: the network encoder part extracts visual features by using a DenseNet-based convolution module, improves the transmission of information and gradient in a network through dense connection, relieves the problem of gradient disappearance and strengthens feature propagation;
and step 3: setting a skip action domain on a decoder part, directly splicing a part of feature maps in the encoding process into the decoding process, using 64 7x7 convolution kernels to realize up-sampling, and using a sigmoid function as an activation function to generate parallax;
and 4, step 4: the binocular matching loss and the depth smoothing loss are enhanced, the model is optimized in iteration, the prediction precision is improved, and the edge of the depth map is kept while the depth map is smoothed.
And 5: the post-processing part is optimized. On one hand, due to the existence of stereo occlusion, the left view can see more content on the left side, and the information of the right part of an object is easy to lose, so that the input image is turned over and a corresponding disparity map is generated, the disparity map is combined with the disparity map of the original image, proper edge information is selected, the output disparity is optimized, and the problem of occlusion of the image edge is solved; on the other hand, based on the object detection technology, the output disparity map is corrected by using the original image, so that the object edge in the scene is highlighted, and the smear is effectively eliminated.
Step 2 is specifically as follows, the training data set of the depth prediction model is a calibrated binocular color image pair, the size of the binocular color image pair is adjusted to 256x512, the binocular color image pair is used as the input of the network, and the binocular color image pair is subjected to 64 7x7 convolution kernels onceConvolution and maximum pooling to obtain
Figure BDA0002527628800000061
The size of the tenar, then into four modules consisting of a denseblock and a transition layer;
the number of layers contained in the four densebocks is respectively 2,6, 12 and 24, the number of network layers is continuously increased, the growth rate (growth _ rate) of deneblocks in all denebocks is set to be 32, the default bn _ size is 4, 1x1 convolution is added before 3x3 convolution in the Bottleneck layers in the dense blocks, the purpose is to reduce the number of parameters of the network, the transition layers are arranged between every two dense blocks, the global information is integrated by using average pooling, and the compactness of the model is improved. The transmission of information and gradient in the network is improved through dense connection, the network is deepened, the problem of gradient disappearance is relieved, and feature propagation is enhanced. The step 4 is specifically as follows: in order to optimize the model and obtain a more accurate dense depth map, binocular matching loss and depth smoothing loss are added, and the constraint of a loss function on the network is strengthened:
(4.1) binocular matching loss: the matching cost calculation is an important measurement standard of a stereo matching algorithm, and the similarity of the relative views of the reconstructed image and the sampled image is compared by utilizing the correlation between the pixels of the binocular image, so that the stereo matching can be enhanced.
Under the condition of spatial similarity, strong correlation exists between pixels of RGB images, and the original left image is assumed to be
Figure BDA0002527628800000062
(i, j represents the position coordinates of the pixel), and the reconstructed left image can be obtained through the distortion operation according to the predicted parallax and the original right image, wherein the left image is obtained through the distortion operation
Figure BDA0002527628800000063
According to the corresponding parallax value of each pixel point of the left image, searching the corresponding pixel point in the right image and then interpolating to obtain the parallax value, and combining by using L1 and a single-scale SSIM (structural similarity) termCalculating a reconstructed view as a photometric cost Cap of image reconstruction
Figure BDA0002527628800000064
View with real input
Figure BDA0002527628800000065
The matching cost of (a) is obtained,
Figure BDA0002527628800000066
at the parallax level, the invention tries to make the left viewpoint parallax map equal to the projected right viewpoint parallax map, specifically d taking the right map as the reference image r D as input image for reconstruction operation, and the left image as reference image l As the input disparity map, d is obtained through the distortion operation l Is reconstructed to obtain a disparity map
Figure BDA0002527628800000067
C lr Facilitates the consistency of the predicted left-right disparity,
Figure BDA0002527628800000068
(4.2) depth smoothing loss: when an edge exists in an image, a large gradient value is ensured, conversely, when a smooth part exists in the image, the gray value change is small, the corresponding gradient is also small, the gray change of the outline edge of a directly predicted parallax image (a depth image) is obvious, the layering sense is strong, the problem of discontinuous depth often occurs at the image gradient, and when the occlusion problem is met, the necessary object boundary is also kept;
since a dense disparity map is required, the disparity gradient can be smoothed locally for the purpose of disparity
Figure BDA0002527628800000071
An L1 penalty is made, however, this assumption erroneously penalizes edges and is not applicable toTypically corresponding to the boundaries of objects in areas of high variation in pixel intensity. Thus, by introducing an edge perception term e, image gradients are used
Figure BDA0002527628800000072
The influence of error punishment is reduced by the edge perception item,
Figure BDA0002527628800000073
on the basis of introducing the edge perception item e, a method for finding a field based on cross self-adaptation is added, and the idea is to limit the support field of a certain pixel point p and reduce the influence brought by the punishment of an error edge. For the point p, four arms respectively extend from the upper, lower, left and right sides, and the point meeting the conditions is accommodated in the supporting area U of the point p p The iteration is continued until the termination when the following conditions are not met. (i.e. the strength must not exceed too much, the distance must not be too far away)
Figure BDA0002527628800000074
. In the scheme, in the binocular matching loss part, the matching loss between binocular image pairs is enhanced based on a reconstruction technology, and the constraint of a parallax layer is increased on the basis of an RGB layer. And according to the original right viewpoint RGB image and the predicted disparity map, warping to obtain a corresponding reconstructed left image, and comparing the reconstructed left RGB image and disparity map with the original left image and predicted disparity map to obtain binocular matching loss.
In the depth smoothing loss part, the depth map is smoothed by the method, and the depth change of the object edge and the occlusion is kept. Performing smoothing operation on part of the disparity map, introducing an edge perception item e to reduce the influence caused by error punishment for the object boundary of a pixel intensity high-change area, and introducing a cross-based self-adaptive domain finding method to limit the support domain U of a certain pixel point p p Make U p All the pixel points in the image have intensity which is not compared with the point pBeyond too much, the distance cannot be too far, and the iteration is terminated until such time as the following condition is not met.
And (5) specifically processing as follows, and improving the generated disparity map. In the aspect of relieving the occlusion of the image edge, for the input picture I, not only the disparity map D is calculated l And calculating a disparity map D for the mirror-inverted image I' of the picture I l ' and then inverting the disparity map to obtain a disparity map D l ″,D l "and D l And (4) aligning. Binding D l "left 5% and D l The right 5% of the image and the average between the two constitutes the final result to reduce the effect of stereo occlusion at the edges of the image. Meanwhile, a smear part for eliminating the edge of an object is added, the object possibly existing in the original image scene is identified based on an object identification technology and is aligned with the output disparity map, the pixels at the edge of the object in the disparity map are enhanced, the pixels at the smear part are set as the average value of the field pixels for eliminating the edge, the object in the scene is highlighted, and the quality of the disparity map is improved. In this scheme, due to the existence of stereo occlusion, the left view can see more content on the left side, and the information on the right part of the object is easily lost, so that for the input picture I, not only is the disparity map D calculated l And calculating a disparity map D for the mirror-inverted image I' of the picture I l ' selecting the appropriate edge information of the two disparity maps, and optimizing the output disparity to reduce the influence of stereo occlusion at the image edge.
Because a moving object may exist in a scene, a fuzzy part exists in an input image, and in order to highlight the object in the scene, the object identification technology is used for identifying the object possibly existing in the original image scene, aligning the object with an output disparity map, enhancing the pixels at the edge of the object in the disparity map, setting the pixels at the smear part as the average value of the pixels in the field of edge removal, eliminating the smear phenomenon to a certain extent, and improving the quality of the disparity map.
The application example is as follows:
as shown in fig. 1, the present invention mainly includes three parts, namely, improvement of encoder module of the disparity prediction network, enhancement of the loss function algorithm, and optimization of post-processing, which are described in detail below for each part:
1. DenseNet module-based ground depth prediction network structure
The network inputs 256x512 binocular RGB image pair after adjusting size, in the encoder part of the network, conv and maximum pooling maxpool through once 64 convolution of 7x7 convolution kernel, the image size becomes 1/4 of the original, the number of channels is 64.
The data then enters four modules consisting of a denseblock and transition layers, the growth rate (growth _ rate) of the denseblock layers in all the denseblocks is set to 32, the default bn _ size =4, the cottleneck layer of denselayer is increased by Conv1x1, 128 channels (bn _ size growth _ rate) are output, and 32 channels (growth _ rate) are output at Conv3x 3. The transition layer uses the average pooling integrated global information and sets the number of output channels to be half of the number of input channels.
2. Performing algorithm optimization on binocular matching loss and depth smoothing loss respectively
(1) Binocular matching loss calculation is an important metric for stereo matching algorithms. Based on a reconstruction technology, the image reconstruction technology is used in RGB and parallax layers, the similarity of the relative views of a reconstructed image and a sampled image is compared, and the calculation cost is aggregated:
use of a combination of L1 and single-scale SSIM terms as the photometric cost C for image reconstruction ap The image that facilitates reconstruction is visually similar to the training input. Suppose the original left image is
Figure BDA0002527628800000081
(i, j represents the position coordinates of the pixel points), and a reconstructed left image can be obtained through distortion operation according to the predicted parallax and the original right image. Resulting from the twisting operation herein
Figure BDA0002527628800000082
According to the corresponding parallax value of each pixel point of the left image, the corresponding pixel point is searched in the right image and then interpolation is carried out to obtain the parallax value. C ap By comparing input artwork
Figure BDA0002527628800000083
And reconstructing the picture
Figure BDA0002527628800000084
It follows that a simplified SSIM is used that contains a 3 × 3 block filter instead of a gaussian filter, where N is the number of pixels:
Figure BDA0002527628800000085
in order to generate more accurate disparity maps, the network is trained to predict the disparity of the left and right images, and only the left view is used as the input of the convolution part of the network, and the disparity map d which not only takes the left image as a reference image is output l And a disparity map d using the right image as a reference image r . To ensure consistency, an L1 left-right disparity consistency penalty is introduced as part of the model. This cost attempts to make the left view disparity map equal to the projected right view disparity map, specifically d, which will be referenced to the right map r D as input image for reconstruction operation, and the left image as reference image l As an input disparity map, d is obtained through a warping operation l Is reconstructed to obtain a disparity map
Figure BDA0002527628800000091
Note that here we get a reconstructed disparity map, not a reconstructed left map, so the left-right disparity consistency loss can be written as C lr The predicted left-right disparity is facilitated to be consistent.
Figure BDA0002527628800000092
Where N is the number of pixels and,
Figure BDA0002527628800000093
i.e. reconstructed
Figure BDA0002527628800000094
The cost is also the same as for all other termsThe disparity map, corresponding to the right viewpoint, can be predicted at all output scales.
(2) The gray level of the outline edge of the predicted parallax image is obviously changed, the layering sense is strong, and the problem of discontinuous depth often occurs at the image gradient. Since dense disparity maps are required here, it is necessary not only to keep the smoothness locally, but also to retain information on object edges and occlusions.
For parallax to remain locally smooth, this document deals with parallax gradients
Figure BDA0002527628800000095
An L1 penalty is made. However, this assumption penalizes edges erroneously and is not applicable to object boundaries that typically correspond to regions of high variation in pixel intensity. Thus, by introducing an edge perception term e, image gradients are used
Figure BDA0002527628800000096
The influence of error punishment is reduced through the edge perception item. Parallax smoothing loss
Figure BDA0002527628800000097
The definition is as follows:
Figure BDA0002527628800000098
here, the parallax gradients in the x-direction and the y-direction are used respectively
Figure BDA0002527628800000099
And image gradient
Figure BDA00025276288000000910
On the basis of introducing the edge perception item e, a method for finding a field based on cross self-adaptation is added, the idea is to limit the support field of a certain pixel point p and reduce the influence brought by the punishment of errors on the edge. For the point p, four arms respectively extend from the upper, lower, left and right sides, and the point meeting the conditions is accommodated in the supporting area U of the point p p 1 toUntil the iteration terminates when the following conditions are not met. (i.e. the strength must not exceed too much, the distance must not be too far away)
Figure BDA00025276288000000911
3. In the post-processing part, the problem of shielding the image edge is relieved, the smear phenomenon of a moving object is effectively eliminated, and the output disparity map is optimized.
Due to the existence of stereo occlusion, the left view can see more content on the left side, and the information of the right part of the object is easy to lose, so that the disparity map D of the input picture I is not only calculated l And calculating a disparity map D for the mirror-inverted image I' of the picture I l ' selecting the appropriate edge information of the two disparity maps, and optimizing the output disparity to reduce the influence of stereo occlusion at the image edge.
Because a moving object may exist in a scene, a fuzzy part exists in an input image, and in order to highlight the object in the scene, the object identification technology is used for identifying the object possibly existing in the original image scene, aligning the object with an output disparity map, enhancing the pixels at the edge of the object in the disparity map, setting the pixels at the smear part as the average value of the pixels in the field of edge removal, eliminating the smear phenomenon to a certain extent, and improving the quality of the disparity map.
It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and all equivalent substitutions or substitutions made on the above-mentioned technical solutions belong to the scope of the present invention.

Claims (4)

1. A monocular scene depth prediction method based on deep learning is characterized by comprising the following steps:
step 1: a preprocessing operation of resizing the high-resolution binocular color image pair to 256x512 and subjecting the image pair of unified size to a plurality of combined data enhancement transforms of random flipping and contrast transform to increase the amount of input data, and then inputting into an encoder of a convolutional network,
step 2: the network encoder part extracts visual features by using a DenseNet-based convolution module, improves the transmission of information and gradient in a network through dense connection, relieves the problem of gradient disappearance and strengthens feature propagation;
and step 3: setting a skip action domain in a decoder part, directly splicing a part of feature maps in the encoding process into the decoding process, using 64 7x7 convolution kernels to realize up-sampling, and using a sigmoid function as an activation function to generate parallax;
and 4, step 4: the binocular matching loss and the depth smoothing loss are enhanced, the model is optimized in iteration, the prediction precision is improved, and the edge of the depth map is kept while the depth map is smoothed;
and 5: the post-processing part is optimized, the input image is inverted to generate corresponding parallax, the parallax is combined with the parallax of the original image to alleviate the problem of edge shielding, the original image is aligned with the parallax image based on the object detection technology, pixels at the edge of the object are enhanced, and the phenomenon of smear can be eliminated to a certain extent.
2. The method for monocular scene depth prediction based on deep learning of claim 1, wherein step 2 is specifically as follows, the training data set of the depth prediction model is a calibrated binocular color image pair, the size is adjusted to 256x512 as the input of the network, and the training data set is subjected to 64 times of 7x7 convolution kernel convolution and maximum pooling once to obtain the depth prediction model
Figure FDA0002527628790000011
The size of the tenar, then into four modules consisting of a denseblock and a transition layer;
the four denseblocks contain layer numbers of 2,6, 12, 24 respectively, so that the number of network layers is continuously deepened, the growth rate (growth _ rate) of the denseblocks in all the denseblocks is set to be 32, the default bn _ size is 4, and the Bottleneck layer in the dense block is added with 1x1 convolution before the convolution of 3x 3.
3. The method for predicting the depth of the monocular scene based on deep learning according to claim 1, wherein the step 4 is specifically as follows:
(3.1) binocular matching loss: the matching cost calculation is an important measurement standard of a stereo matching algorithm, and by utilizing the correlation between pixels of the binocular image,
under the condition of spatial similarity, strong correlation exists between pixels of RGB images, and the original left image is assumed to be
Figure FDA0002527628790000012
(i, j represents the position coordinates of the pixel points), and obtaining a reconstructed left image through a warping operation according to the predicted parallax and the original right image, wherein the left image is obtained through the warping operation
Figure FDA0002527628790000013
According to the corresponding parallax value of each pixel point of the left image, searching the corresponding pixel point in the right image and then interpolating to obtain the corresponding parallax value, and calculating the reconstructed view by using the combination of the L1 and the single-scale SSIM term as the luminosity cost Cap of image reconstruction
Figure FDA0002527628790000014
View with real input
Figure FDA0002527628790000015
The matching cost of (a) is obtained,
Figure FDA0002527628790000021
on the parallax level, the left viewpoint parallax image is equal to the projected right viewpoint parallax image, and d is taken as the reference image of the right image r D as input image for reconstruction operation, and the left image as reference image l As an input disparity map, d is obtained through a warping operation l Is reconstructed to obtain a disparity map
Figure FDA0002527628790000022
C lr Facilitates the consistency of the predicted left-right disparity,
Figure FDA0002527628790000023
(3.2) depth smoothing loss: using image gradients by introducing an edge-perception term e
Figure FDA0002527628790000026
The influence of error punishment is reduced through the edge perception item,
Figure FDA0002527628790000024
on the basis of introducing the edge perception item e, a method for finding the field based on cross self-adaptation is added, four arms are respectively extended from the upper part, the lower part, the left part and the right part of the point p, and the point meeting the conditions is accommodated in a supporting area U of the point p p Iterating until the iteration is terminated when the following conditions are not met;
Figure FDA0002527628790000025
4. the method of claim 1, wherein the step (5) is specifically to improve the generated disparity map, and to reduce the occlusion of the image edge, not only the disparity map D of the input picture I is calculated l Also, a disparity map D is calculated for the mirror-inverted image I' of the picture I l ' and then inverting the disparity map to obtain a disparity map D l ″,D l "and D l Aligning; binding D l "left 5% and D l The right 5% of the image and the average between the two constitutes the final result to reduce the effect of stereo occlusion at the edges of the image.
CN202010508803.3A 2020-06-06 2020-06-06 Monocular scene depth prediction method based on deep learning Active CN111899295B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010508803.3A CN111899295B (en) 2020-06-06 2020-06-06 Monocular scene depth prediction method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010508803.3A CN111899295B (en) 2020-06-06 2020-06-06 Monocular scene depth prediction method based on deep learning

Publications (2)

Publication Number Publication Date
CN111899295A CN111899295A (en) 2020-11-06
CN111899295B true CN111899295B (en) 2022-11-15

Family

ID=73208030

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010508803.3A Active CN111899295B (en) 2020-06-06 2020-06-06 Monocular scene depth prediction method based on deep learning

Country Status (1)

Country Link
CN (1) CN111899295B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112561979B (en) * 2020-12-25 2022-06-28 天津大学 Self-supervision monocular depth estimation method based on deep learning
CN113140011B (en) * 2021-05-18 2022-09-06 烟台艾睿光电科技有限公司 Infrared thermal imaging monocular vision distance measurement method and related components
CN114119698B (en) * 2021-06-18 2022-07-19 湖南大学 Unsupervised monocular depth estimation method based on attention mechanism
CN114022529B (en) * 2021-10-12 2024-04-16 东北大学 Depth perception method and device based on self-adaptive binocular structured light
CN115184016A (en) * 2022-09-06 2022-10-14 江苏东控自动化科技有限公司 Elevator bearing fault detection method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163246B (en) * 2019-04-08 2021-03-30 杭州电子科技大学 Monocular light field image unsupervised depth estimation method based on convolutional neural network
CN110310317A (en) * 2019-06-28 2019-10-08 西北工业大学 A method of the monocular vision scene depth estimation based on deep learning
CN110490919B (en) * 2019-07-05 2023-04-18 天津大学 Monocular vision depth estimation method based on deep neural network

Also Published As

Publication number Publication date
CN111899295A (en) 2020-11-06

Similar Documents

Publication Publication Date Title
CN111899295B (en) Monocular scene depth prediction method based on deep learning
CN110738697B (en) Monocular depth estimation method based on deep learning
CN109377530B (en) Binocular depth estimation method based on depth neural network
CN108648161B (en) Binocular vision obstacle detection system and method of asymmetric kernel convolution neural network
Lee et al. Local disparity estimation with three-moded cross census and advanced support weight
CN106408513B (en) Depth map super resolution ratio reconstruction method
CN115205489A (en) Three-dimensional reconstruction method, system and device in large scene
CN104954780A (en) DIBR (depth image-based rendering) virtual image restoration method applicable to high-definition 2D/3D (two-dimensional/three-dimensional) conversion
CN113610912B (en) System and method for estimating monocular depth of low-resolution image in three-dimensional scene reconstruction
CN109741358B (en) Superpixel segmentation method based on adaptive hypergraph learning
CN114996814A (en) Furniture design system based on deep learning and three-dimensional reconstruction
CN114677479A (en) Natural landscape multi-view three-dimensional reconstruction method based on deep learning
Pan et al. Single-image dehazing via dark channel prior and adaptive threshold
CN115631223A (en) Multi-view stereo reconstruction method based on self-adaptive learning and aggregation
CN110415169A (en) A kind of depth map super resolution ratio reconstruction method, system and electronic equipment
CN114820987A (en) Three-dimensional reconstruction method and system based on multi-view image sequence
CN111369435B (en) Color image depth up-sampling method and system based on self-adaptive stable model
CN113421210A (en) Surface point cloud reconstruction method based on binocular stereo vision
CN115115860A (en) Image feature point detection matching network based on deep learning
Zhu et al. Hybrid scheme for accurate stereo matching
Yang et al. Hierarchical joint bilateral filtering for depth post-processing
Pan et al. An automatic 2D to 3D video conversion approach based on RGB-D images
CN115272450A (en) Target positioning method based on panoramic segmentation
CN107194931A (en) It is a kind of that the method and system for obtaining target depth information is matched based on binocular image
CN111985535A (en) Method and device for optimizing human body depth map through neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant