CN111899295B

CN111899295B - A monocular scene depth prediction method based on deep learning

Info

Publication number: CN111899295B
Application number: CN202010508803.3A
Authority: CN
Inventors: 姚莉; 缪静
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-06-06
Filing date: 2020-06-06
Publication date: 2022-11-15
Anticipated expiration: 2040-06-06
Also published as: CN111899295A

Abstract

A monocular scene depth prediction method based on deep learning is suitable for monocular pictures or videos, a calibrated binocular color image is used for training a depth prediction model, a DenseNet convolution module is used for extracting a feature space in a network architecture, and dense blocks and transition layers in the feature space are used for directly connecting each layer in the network with the previous layer so as to realize the repeated utilization of features; the binocular matching loss is improved, the depth prediction problem is regarded as the image reconstruction problem, the color image and the parallax image of the input left viewpoint are sampled to generate a virtual color image and a parallax image, and the consistency of the generated virtual view and the corresponding input right viewpoint image is restrained at an RGB level and a parallax level by utilizing a stereo matching algorithm of a binocular image pair, so that better depth is obtained; the depth smoothing loss is improved, the high-quality dense depth map can be generated, the problem of artifacts caused by shielding in monocular scene depth prediction is effectively solved, and the requirements of converting 2D (two-dimensional) to 3D (three-dimensional) of a plurality of indoor and outdoor real scenes can be met.

Description

A monocular scene depth prediction method based on deep learning

技术领域technical field

本发明涉及一种基于深度学习的单目场景深度预测方法，属于计算机视觉和图像处理领域。The invention relates to a monocular scene depth prediction method based on deep learning, belonging to the fields of computer vision and image processing.

背景技术Background technique

单目深度预测是计算机视觉中备受关注的研究课题，在自动驾驶、VR游戏制作、影视制作等领域具有广泛的应用价值。然而，目前该领域仍然存在较多为解决的问题，例如：Monocular depth prediction is a research topic that has attracted much attention in computer vision, and has a wide range of application values in areas such as autonomous driving, VR game production, and film and television production. However, there are still many unsolved problems in this field, such as:

1)使用雷达激光采集深度数据的过程耗费巨大且受天气影响较大；1) The process of using radar laser to collect depth data is costly and greatly affected by weather;

2)生成的深度图像由于原图像中的光照阴影或遮挡而产生伪影和拖影；2) The generated depth image produces artifacts and smears due to light shadows or occlusions in the original image;

3)基于稀疏深度图恢复的深度信息的方法存在边缘深度不连续的问题；3) The method based on the depth information restored by the sparse depth map has the problem of discontinuous edge depth;

4)深度模型不是完全可微的，失去了在优化中梯度的可计算性,使得训练次优；4) The depth model is not completely differentiable, and the computability of the gradient in optimization is lost, making the training suboptimal;

5)图像生成模型不具备缩放到大输出分辨率的图像的能力；5) The image generation model does not have the ability to scale images to large output resolutions;

6)模型的泛化能力普遍受到训练数据制约。6) The generalization ability of the model is generally restricted by the training data.

发明内容Contents of the invention

本发明为解决上述问题，提供了一种基于深度学习的单目场景深度预测方法，该方法适用于单目图片或视频，这种方法能够获得场景图像的密集深度图，精确率可达到91％，能够满足室内外多个现实场景的深度预测需求。In order to solve the above problems, the present invention provides a monocular scene depth prediction method based on deep learning, which is suitable for monocular pictures or videos. This method can obtain dense depth maps of scene images, and the accuracy rate can reach 91%. , which can meet the depth prediction requirements of multiple real-life scenes indoors and outdoors.

为了实现上述目的，本发明的技术方案如下：一种基于深度学习的单目场景深度预测方法，以校准过的彩色图像对作为训练输入，使用DenseNet卷积模块改进编码器部分的网络架构，在双目立体匹配和图像平滑的多个层面加强了损失约束，利用后处理改善了遮挡问题，总体改进了单目图像深度预测效果，所述方法包括以下步骤：In order to achieve the above object, the technical solution of the present invention is as follows: a method for predicting the depth of a monocular scene based on deep learning, using calibrated color image pairs as training input, using the DenseNet convolution module to improve the network architecture of the encoder part, in Multiple levels of binocular stereo matching and image smoothing strengthen the loss constraint, use post-processing to improve the occlusion problem, and generally improve the effect of monocular image depth prediction. The method includes the following steps:

步骤1：预处理操作，将高分辨率的双目彩色图像对调整大小为256x512,并将统一了大小的图像对进行随机翻转和对比度变换多种组合数据增强变换来增加输入数据的量，之后输入到卷积网络的编码器中。Step 1: Preprocessing operation, resize the high-resolution binocular color image pair to 256x512, and perform random flip and contrast transformation on the unified image pair to increase the amount of input data. Input into the encoder of the convolutional network.

步骤2：网络编码器部分，使用基于DenseNet卷积模块进行视觉特征提取，通过密集连接改善网络中信息和梯度的传递，缓解梯度消失问题，加强特征传播；Step 2: In the network encoder part, use the DenseNet convolution module to extract visual features, improve the transmission of information and gradients in the network through dense connections, alleviate the problem of gradient disappearance, and strengthen feature propagation;

步骤3：在解码器部分设置skip作用域，将编码过程的部分特征图直接拼接到解码过程中，使用64个7x7卷积核卷积实现上采样，将sigmoid函数作为激活函数，生成视差；Step 3: Set the skip scope in the decoder part, directly stitch some feature maps of the encoding process into the decoding process, use 64 7x7 convolution kernel convolutions to achieve upsampling, and use the sigmoid function as the activation function to generate parallax;

步骤4：加强了双目匹配损失和深度平滑损失，在迭代中优化模型，提高预测精度，使深度图平滑的同时保有其边缘；Step 4: Enhance the binocular matching loss and depth smoothing loss, optimize the model in iterations, improve the prediction accuracy, and make the depth map smooth while maintaining its edges;

步骤5：优化了后处理部分。一方面由于立体遮挡的存在，左视图能看见左侧更多的内容，容易丢失物体右侧部分的信息，因此将输入图像翻转并生成对应的视差图，与原图的视差图结合，选取合适的边缘信息，优化输出的视差，缓解图像边缘的遮挡问题；另一方面，基于物体检测技术，利用原图对输出视差图进行修正，突出了场景中的物体边缘，有效消除拖影。Step 5: Optimized the post-processing part. On the one hand, due to the existence of three-dimensional occlusion, the left view can see more content on the left side, and it is easy to lose the information on the right side of the object. Therefore, the input image is flipped and the corresponding disparity map is generated, which is combined with the disparity map of the original image. edge information, optimize the output parallax, and alleviate the occlusion problem of the image edge; on the other hand, based on the object detection technology, use the original image to correct the output disparity map, highlight the object edge in the scene, and effectively eliminate the smear.

作为本发明的一种改进，步骤2具体如下，该深度预测模型的训练数据集为经过校准的双目彩色图像对，将大小调整为256x512作为网络的输入，经过一次64个7x7卷积核卷积和最大池化，得到

大小的tensor，随后进入四个由denseblock和过渡层组合而成的模块中；As an improvement of the present invention, step 2 is specifically as follows, the training data set of the depth prediction model is a calibrated binocular color image pair, the size is adjusted to 256x512 as the input of the network, and 64 7x7 convolution kernel volumes are processed once Product and maximum pooling, get

The tensor of size, then enters four modules composed of denseblock and transition layer;

四个denseblock包含的layer数分别为2，6，12，24，使网络层数不断加深，设定所有denseblock中dense layer的增长率(growth_rate)为32，默认bn_size为4，在稠密块中的Bottleneck层，在3x3的卷积之前加入1x1的卷积，目的是减小网络的参数量，过渡层在每两个稠密块之间，使用了平均池化整合全局信息，提升了模型的紧凑程度。通过密集连接改善网络中信息和梯度的传递，加深网络的同时缓解了梯度消失问题，加强了特征传播。The layers contained in the four denseblocks are 2, 6, 12, and 24 respectively, so that the number of network layers is continuously deepened, and the growth rate (growth_rate) of the dense layer in all denseblocks is set to 32, and the default bn_size is 4. In the dense block Bottleneck layer, adding 1x1 convolution before 3x3 convolution, the purpose is to reduce the parameter amount of the network, the transition layer uses average pooling to integrate global information between every two dense blocks, and improves the compactness of the model . The transfer of information and gradients in the network is improved through dense connections, and the problem of gradient disappearance is alleviated while the network is deepened, and feature propagation is strengthened.

作为本发明的一种改进，所述步骤4具体如下：为了优化模型，获得更为精确得密集深度图，添加了双目匹配损失和深度平滑损失，加强了损失函数对网络的约束：As an improvement of the present invention, the step 4 is specifically as follows: In order to optimize the model and obtain a more accurate dense depth map, binocular matching loss and depth smoothing loss are added, and the constraints of the loss function on the network are strengthened:

(4.1)双目匹配损失：匹配代价计算是立体匹配算法的重要衡量标准，利用双目图像对像素间的相关性，比较重建图像与被采样图像的相对视图的相似度，可以加强立体匹配，本发明在RGB层面的基础上增加了视差层面的约束。(4.1) Binocular matching loss: The matching cost calculation is an important measure of the stereo matching algorithm. Using the correlation between the binocular image pair pixels, comparing the similarity between the reconstructed image and the relative view of the sampled image can strengthen the stereo matching. The present invention adds constraints on the parallax level on the basis of the RGB level.

在空间相似的情况下，RGB图像的像素间存在着很强的相关性，假设原来的左图为

(i，j表示像素点的位置坐标)，根据其预测的视差以及原有的右图，可以通过扭曲操作得到重构后的左图，这里的扭曲操作得出的

是根据左图每个像素点对应的视差值，在右图中寻找对应的像素点再插值得到的，使用L1和单尺度SSIM项组合作为图像重建的光度成本Cap，计算重建视图

与真实输入的视图

的匹配代价，In the case of spatial similarity, there is a strong correlation between the pixels of the RGB image, assuming that the original left image is

(i, j represent the position coordinates of the pixel point), according to the predicted parallax and the original right image, the reconstructed left image can be obtained through the distortion operation, and the distortion operation here is obtained

According to the parallax value corresponding to each pixel in the left image, find the corresponding pixel in the right image and then interpolate. Use the combination of L1 and single-scale SSIM items as the photometric cost Cap of image reconstruction to calculate the reconstructed view

view with real input

the matching cost of

在视差层面，本发明试图使左视点视差图等于投影的右视点视差图，具体地说，将以右图为参考图像的d^r作为重建操作的输入图像，再以左图为参考图像的d^l作为输入的视差图，经过扭曲操作就会得到d^l的重构视差图

C_lr促进预测的左右视差一致，In terms of disparity, the present invention tries to make the disparity map of the left viewpoint equal to the projected disparity map of the right viewpoint. Specifically, the d r with the right image as the reference image is used as the input image for the reconstruction operation, and the d ^r of the left image as the reference image ^l is used as the input disparity map, and the reconstructed disparity map of d ^l will be obtained after the warping operation

C _lr promotes consistent left and right disparity predictions,

(4.2)深度平滑损失：当图像中存在边缘时，一定有较大的梯度值，相反，当图像中有比较平滑的部分时，灰度值变化较小，则相应的梯度也较小，直接预测的视差图(深度图)轮廓边缘灰度变化明显，层次感强，在图像梯度处经常会出现深度不连续的问题，在遇到遮挡问题时，也要保持必要的物体边界；(4.2) Depth smoothing loss: When there is an edge in the image, there must be a larger gradient value. On the contrary, when there is a smoother part in the image, the gray value changes less, and the corresponding gradient is also smaller. Direct The gray scale of the edge of the predicted disparity map (depth map) changes significantly, and the sense of hierarchy is strong. The problem of depth discontinuity often occurs at the gradient of the image. When encountering occlusion problems, the necessary object boundaries must also be maintained;

由于需要密集的视差图，为了视差在局部上保持平滑，可以对视差梯度

进行L1惩罚，然而，这种假设会错误地惩罚边缘，并不适用于通常对应于像素强度高变化区域的物体边界。因此，通过引入一个边缘感知项e，使用图像梯度

通过边缘感知项减少错误惩罚带来的影响，Due to the need for a dense disparity map, in order to keep the disparity locally smooth, the disparity gradient can be adjusted

Applying an L1 penalty, however, this assumption incorrectly penalizes edges and does not apply to object boundaries which usually correspond to regions of high variation in pixel intensity. Therefore, by introducing an edge-aware term e, using the image gradient

Reduce the impact of error penalties through edge-aware terms,

在引入边缘感知项e的基础上，增加基于交叉的自适应找到领域的方法，思想就是限制某像素点p的支持领域，减少错误惩罚边缘带来的影响。对于点p，在上下左右分别延伸出四条臂，将满足条件的点纳入点p的支持区域U_p，一直迭代直到不满足以下条件的时候终止。(就是强度不能超过太多，距离不能太远)On the basis of introducing the edge-aware item e, the method of adaptively finding the field based on crossover is added, the idea is to limit the support field of a certain pixel point p, and reduce the impact of the error penalty edge. For point p, four arms are extended at the top, bottom, left, and right respectively, and the points that meet the conditions are included in the support area U _p of point p, and iterated until the following conditions are not met. (that is, the intensity cannot exceed too much, and the distance cannot be too far)

。该方案中，在双目匹配损失部分，本发明基于重建技术加强双目图像对之间的匹配损失，在RGB层面的基础上增加了视差层面的约束。根据原有的右视点RGB图和预测的视差图，扭曲得到重构后相对应的左图，将重构的左RGB图和视差图与原有的左图和预测视差图比较，作为双目匹配损失。. In this solution, in the part of binocular matching loss, the present invention strengthens the matching loss between binocular image pairs based on the reconstruction technology, and increases the constraint of parallax level on the basis of RGB level. According to the original right-view RGB image and the predicted disparity image, twist to obtain the corresponding left image after reconstruction, compare the reconstructed left RGB image and disparity image with the original left image and the predicted disparity image, and use it as a binocular matching loss.

在深度平滑损失部分，本发明使深度图平滑的同时保持了物体边缘和遮挡的深度变化。在视差图的局部上进行平滑操作，同时对于像素强度高变化区域的物体边界，不仅引入一个边缘感知项e，减少错误惩罚带来的影响，还引入基于交叉的自适应找到领域的方法，限制某像素点p的支持领域U_p，使U_p中所有像素点与点p相比，强度不能超过太多，距离不能太远，一直迭代直到不满足以下条件的时候终止。In the depth smoothing loss part, the present invention smoothes the depth map while maintaining the depth variation of object edges and occlusions. The smoothing operation is performed on the local part of the disparity map. At the same time, for the object boundary in the area of high pixel intensity change, not only an edge perception item e is introduced to reduce the impact of error penalty, but also an adaptive method based on crossover is introduced to find the field. The support area U _p of a certain pixel point p, so that compared with point p, the intensity of all pixels in U _p cannot exceed too much, and the distance cannot be too far, and iterate until the following conditions are not met.

作为本发明的一种改进，步骤(5)具体如下，对生成的视差图进行改进。在缓解图像边缘的遮挡方面，对于输入图片I，不仅计算其视差图D_l，也对图片I的镜面翻转图像I′计算视差图D_l′，再将该视差图翻转得到视差图D_l″，D_l″和D_l对齐。结合D_l″的左边5％和D_l的右边5％，和两者中间的平均组成最后的结果，以减少图像边缘立体遮挡的影响。同时增加了消除物体边缘的拖影部分，基于物体识别技术将原图场景中可能存在的物体识别出来，与输出视差图对齐，将视差图中物体边缘的像素增强，将拖影部分的像素设置为除去边缘的领域像素的平均值，突出了场景中的物体，提高了视差图质量。该方案中，由于立体遮挡的存在，左视图能看见左侧更多的内容，容易丢失物体右侧部分的信息，因此对于输入图片I，不仅计算其视差图D_l，也对图片I的镜面翻转图像I′计算视差图D_l′，选取两幅视差图合适的边缘信息，优化输出的视差，以减少图像边缘立体遮挡的影响。As an improvement of the present invention, step (5) is specifically as follows, and the generated disparity map is improved. In terms of alleviating the occlusion of the image edge, for the input picture I, not only the disparity map D _l is calculated, but also the disparity map D _l ′ is calculated for the mirror flipped image I′ of the picture I, and then the disparity map is flipped to obtain the disparity map D _l ″ , D _l ″ and D _l are aligned. Combining the left 5% of D _l ″ and the right 5% of D _l , and the average between the two form the final result to reduce the impact of image edge stereo occlusion. At the same time, it increases the smear part of the elimination object edge, based on object recognition The technology recognizes the objects that may exist in the original image scene, aligns them with the output disparity map, enhances the pixels on the edge of the object in the disparity map, and sets the pixels in the smear part to the average value of the pixels in the field that removes the edge, highlighting the image in the scene. object, which improves the quality of the disparity map. In this scheme, due to the existence of stereo occlusion, the left view can see more content on the left side, and it is easy to lose the information on the right part of the object. Therefore, for the input picture I, not only the disparity map is calculated D _l , also calculates the disparity map D _l ′ for the mirror flipped image I′ of the picture I, selects the appropriate edge information of the two disparity maps, and optimizes the output parallax to reduce the influence of the three-dimensional occlusion of the image edge.

由于场景中可能存在运动物体，输入的图像存在模糊的部分，为了突出场景中的物体，本发明基于物体识别技术将原图场景中可能存在的物体识别出来，与输出视差图对齐，将视差图中物体边缘的像素增强，将拖影部分的像素设置为除去边缘的领域像素的平均值，在一定程度上消除了拖影现象，提高了视差图质量。Since there may be moving objects in the scene, the input image has blurred parts. In order to highlight the objects in the scene, the present invention recognizes the objects that may exist in the original scene based on the object recognition technology, aligns them with the output disparity map, and converts the disparity map The pixel enhancement of the edge of the object in the middle is set to the pixel of the smear part as the average value of the field pixels except the edge, which eliminates the smear phenomenon to a certain extent and improves the quality of the disparity map.

该技术方案适用于单目图片或视频，本发明利用已校准的双目彩色图像对训练深度预测模型，在网络架构中使用了DenseNet卷积模块提取特征空间，使用其中的稠密块和过渡层，使网络中的每一层都直接与其前面层相连，实现特征的重复利用；改进了双目匹配损失，将深度预测问题视为图像重建问题，对输入的左视点彩色图和视差图采样生成虚拟彩色图和视差图，利用双目图像对的立体匹配算法，在RGB层面和视差层面，约束生成的虚拟视图和对应输入的右视点图像的一致性，从而获得更好的深度；改进了深度平滑损失，对图像中比较平滑的部分或深度不连续的区域进行L1惩罚，对于像素强度高变化区域的物体边界或遮挡部分，在引入边缘感知项e的基础上增加了基于交叉的自适应找到领域的方法，限制某像素点p的支持领域，减少错误惩罚边缘带来的影响；在后处理优化中，基于物体检测技术，利用原图对输出视差图进行修正，突出场景中的物体边缘，有效消除拖影；为了在物体右侧也能产生更好的效果，将输入原图像翻转生成对应的翻转视差图，将反转视差图和原视差图截取左右的5％结合，中间部分取两者平均值组成最后的结果；由于运动物体预测的视差图可能存在模糊和拖影，为了突出场景中的物体，本发明基于物体识别技术将原图场景中可能存在的物体识别出来，与输出视差图对齐，将视差图中物体边缘的像素增强，将拖影部分的像素设置为除去边缘的领域像素的平均值，在一定程度上消除了拖影现象，提高了视差图质量。本发明能够生成高质量的密集深度图，有效改善单目场景深度预测中遮挡产生的伪影问题，可满足室内外多个现实场景的2D转3D需求。The technical solution is applicable to monocular pictures or videos. The present invention utilizes calibrated binocular color image pairs to train the depth prediction model, uses the DenseNet convolution module to extract the feature space in the network architecture, and uses the dense blocks and transition layers therein. Each layer in the network is directly connected to its previous layer to realize the reuse of features; the binocular matching loss is improved, the depth prediction problem is regarded as an image reconstruction problem, and the input left-viewpoint color map and disparity map are sampled to generate a virtual image. Color image and disparity image, using the stereo matching algorithm of binocular image pairs, constrain the consistency of the generated virtual view and the corresponding input right viewpoint image at the RGB level and disparity level, so as to obtain better depth; improved depth smoothing Loss, L1 penalty is applied to the smoother part of the image or the area with discontinuous depth. For the object boundary or occlusion part of the area with high pixel intensity change, the cross-based adaptive finding field is added on the basis of introducing the edge perception item e The method limits the support area of a certain pixel point p, and reduces the impact of the error penalty edge; in the post-processing optimization, based on the object detection technology, the output disparity map is corrected by using the original image, and the object edge in the scene is highlighted, effectively Eliminate smearing; in order to produce better effects on the right side of the object, flip the input original image to generate a corresponding flipped disparity map, combine the reversed disparity map with the original disparity map to intercept 5% of the left and right, and take the two in the middle The average value constitutes the final result; since the disparity map predicted by the moving object may have blur and smear, in order to highlight the objects in the scene, the present invention recognizes the objects that may exist in the original image scene based on the object recognition technology, and outputs the disparity map Alignment, enhances the pixels on the edge of the object in the disparity map, and sets the pixels in the smear part to the average value of the pixels in the area where the edge is removed, which eliminates the smear phenomenon to a certain extent and improves the quality of the disparity map. The invention can generate a high-quality dense depth map, effectively improve the artifact problem caused by occlusion in the monocular scene depth prediction, and can meet the 2D to 3D conversion requirements of multiple real scenes indoors and outdoors.

相对于现有技术，本发明的优点如下：1)本发明在利用双目图像对训练得出场景深度预测模型的基础上，改进了卷积模块。利用DenseNet模块中特征图稠密链接的特点，每一层都接受前序层的特征输入并将本层特征输出给后续所有层，减少了信息，在传递过程中的消耗，提高了深度预测精度；2)发明在双目匹配损失部分，利用重建技术，在双目RGB图像结构相似性约束的基础上增加了左右视差一致性约束，充分利用了双目图像作为训练数据的优势；3)本发明在视差平滑损失部分，在视差图局部进行视差平滑操作以得到平滑的深度图，同时引入边缘感知项和自适应找到领域的方法，保留图像和遮挡的边缘信息，获得了更加清晰地深度图；4)本发明对输出视差图的边缘进行了优化，通过将输入图像镜面反转获得图像右侧边缘的更多信息，有效缓解了遮挡问题；5)本发明能够有效消除视差图中的拖影问题，通过检测原图中的物体，将其与视差图对齐，增强了物体边缘的像素值，提高了单个物体视差预测精度。Compared with the prior art, the present invention has the following advantages: 1) The present invention improves the convolution module on the basis of using binocular image pairs to train and obtain a scene depth prediction model. Utilizing the feature of dense linking of the feature map in the DenseNet module, each layer accepts the feature input of the previous layer and outputs the feature of this layer to all subsequent layers, which reduces information consumption in the transmission process and improves the depth prediction accuracy; 2) In the binocular matching loss part, the invention uses reconstruction technology to increase the left and right parallax consistency constraints on the basis of binocular RGB image structural similarity constraints, and fully utilizes the advantages of binocular images as training data; 3) the present invention In the disparity smoothing loss part, the disparity smoothing operation is performed locally on the disparity map to obtain a smooth depth map. At the same time, the edge perception item and the method of adaptively finding the field are introduced to preserve the edge information of the image and occlusion, and a clearer depth map is obtained; 4) The present invention optimizes the edge of the output disparity map, and obtains more information on the right edge of the image by inverting the input image mirror, which effectively alleviates the occlusion problem; 5) the present invention can effectively eliminate the smear in the disparity map The problem is that by detecting objects in the original image and aligning them with the disparity map, the pixel values of the object edges are enhanced, and the accuracy of single object disparity prediction is improved.

附图说明Description of drawings

图1为本发明的整体流程图，Fig. 1 is the overall flowchart of the present invention,

图2为双目匹配损失示意图，Figure 2 is a schematic diagram of binocular matching loss,

图3为自适应找到合理领域示意图。Figure 3 is a schematic diagram of finding a reasonable domain adaptively.

具体实施方式Detailed ways

下面结合附图对本发明进行详细阐述，具体步骤如下。The present invention will be described in detail below in conjunction with the accompanying drawings, and the specific steps are as follows.

实施例1：如图1所示，一种基于深度学习的单目场景深度预测方法，以校准过的彩色图像对作为训练输入，使用DenseNet卷积模块改进编码器部分的网络架构，在双目立体匹配和图像平滑的多个层面加强了损失约束，利用后处理改善了遮挡问题，总体改进了单目图像深度预测效果，所述方法包括以下步骤：Embodiment 1: As shown in Figure 1, a method for predicting the depth of a monocular scene based on deep learning uses a calibrated color image pair as a training input, and uses the DenseNet convolution module to improve the network architecture of the encoder part. Multiple levels of stereo matching and image smoothing strengthen the loss constraints, use post-processing to improve the occlusion problem, and generally improve the effect of monocular image depth prediction. The method includes the following steps:

步骤4：加强了双目匹配损失和深度平滑损失，在迭代中优化模型，提高预测精度，使深度图平滑的同时保有其边缘。Step 4: Enhance the binocular matching loss and depth smoothing loss, optimize the model in iterations, improve the prediction accuracy, and smooth the depth map while maintaining its edges.

步骤5：优化了后处理部分。一方面由于立体遮挡的存在，左视图能看见左侧更多的内容，容易丢失物体右侧部分的信息，因此将输入图像翻转并生成对应的视差图，与原图的视差图结合，选取合适的边缘信息，优化输出的视差，缓解图像边缘的遮挡问题；另一方面，基于物体检测技术，利用原图对输出视差图进行修正，突出了场景中的物体边缘，有效消除拖影。Step 5: Optimized the post-processing part. On the one hand, due to the existence of three-dimensional occlusion, the left view can see more content on the left side, and it is easy to lose the information on the right side of the object. Therefore, the input image is flipped and the corresponding disparity map is generated, which is combined with the disparity map of the original image. edge information, optimize the output parallax, and alleviate the occlusion problem of the image edge; on the other hand, based on the object detection technology, the original image is used to correct the output disparity map, highlighting the object edge in the scene, and effectively eliminating smear.

步骤2具体如下，该深度预测模型的训练数据集为经过校准的双目彩色图像对，将大小调整为256x512作为网络的输入，经过一次64个7x7卷积核卷积和最大池化，得到

大小的tensor，随后进入四个由denseblock和过渡层组合而成的模块中；The details of step 2 are as follows. The training data set of the depth prediction model is a calibrated binocular color image pair, and the size is adjusted to 256x512 as the input of the network. After a convolution of 64 7x7 convolution kernels and maximum pooling, the obtained

四个denseblock包含的layer数分别为2，6，12，24，使网络层数不断加深，设定所有denseblock中dense layer的增长率(growth_rate)为32，默认bn_size为4，在稠密块中的Bottleneck层，在3x3的卷积之前加入1x1的卷积，目的是减小网络的参数量，过渡层在每两个稠密块之间，使用了平均池化整合全局信息，提升了模型的紧凑程度。通过密集连接改善网络中信息和梯度的传递，加深网络的同时缓解了梯度消失问题，加强了特征传播。所述步骤4具体如下：为了优化模型，获得更为精确得密集深度图，添加了双目匹配损失和深度平滑损失，加强了损失函数对网络的约束：The layers contained in the four denseblocks are 2, 6, 12, and 24 respectively, so that the number of network layers is continuously deepened, and the growth rate (growth_rate) of the dense layer in all denseblocks is set to 32, and the default bn_size is 4. In the dense block Bottleneck layer, adding 1x1 convolution before 3x3 convolution, the purpose is to reduce the parameter amount of the network, the transition layer uses average pooling to integrate global information between every two dense blocks, and improves the compactness of the model . The transfer of information and gradients in the network is improved through dense connections, and the problem of gradient disappearance is alleviated while the network is deepened, and feature propagation is strengthened. The step 4 is as follows: In order to optimize the model and obtain a more accurate dense depth map, binocular matching loss and depth smoothing loss are added, and the constraints of the loss function on the network are strengthened:

与真实输入的视图

view with real input

the matching cost of

C _lr promotes consistent left and right disparity predictions,

Reduce the impact of error penalties through edge-aware terms,

步骤(5)具体过程如下，对生成的视差图进行改进。在缓解图像边缘的遮挡方面，对于输入图片I，不仅计算其视差图D_l，也对图片I的镜面翻转图像I′计算视差图D_l′，再将该视差图翻转得到视差图D_l″，D_l″和D_l对齐。结合D_l″的左边5％和D_l的右边5％，和两者中间的平均组成最后的结果，以减少图像边缘立体遮挡的影响。同时增加了消除物体边缘的拖影部分，基于物体识别技术将原图场景中可能存在的物体识别出来，与输出视差图对齐，将视差图中物体边缘的像素增强，将拖影部分的像素设置为除去边缘的领域像素的平均值，突出了场景中的物体，提高了视差图质量。该方案中，由于立体遮挡的存在，左视图能看见左侧更多的内容，容易丢失物体右侧部分的信息，因此对于输入图片I，不仅计算其视差图D_l，也对图片I的镜面翻转图像I′计算视差图D_l′，选取两幅视差图合适的边缘信息，优化输出的视差，以减少图像边缘立体遮挡的影响。The specific process of step (5) is as follows, and the generated disparity map is improved. In terms of alleviating the occlusion of the image edge, for the input picture I, not only the disparity map D _l is calculated, but also the disparity map D _l ′ is calculated for the mirror flipped image I′ of the picture I, and then the disparity map is flipped to obtain the disparity map D _l ″ , D _l ″ and D _l are aligned. Combining the left 5% of D _l ″ and the right 5% of D _l , and the average between the two form the final result to reduce the impact of image edge stereo occlusion. At the same time, it increases the smear part of the elimination object edge, based on object recognition The technology recognizes the objects that may exist in the original image scene, aligns them with the output disparity map, enhances the pixels on the edge of the object in the disparity map, and sets the pixels in the smear part to the average value of the pixels in the field that removes the edge, highlighting the image in the scene. object, which improves the quality of the disparity map. In this scheme, due to the existence of stereo occlusion, the left view can see more content on the left side, and it is easy to lose the information on the right part of the object. Therefore, for the input picture I, not only the disparity map is calculated D _l , also calculates the disparity map D _l ′ for the mirror flipped image I′ of the picture I, selects the appropriate edge information of the two disparity maps, and optimizes the output parallax to reduce the influence of the three-dimensional occlusion of the image edge.

应用实施例：Application example:

如图1所示，本发明主要包括三部分，即视差预测网络地编码器模块改进、损失函数算法加强和后处理的优化，下面针对每一部分进行详细说明：As shown in Figure 1, the present invention mainly includes three parts, namely the improvement of the encoder module of the disparity prediction network, the enhancement of the loss function algorithm and the optimization of the post-processing. The following describes each part in detail:

1、基于DenseNet模块地深度预测网络结构1. Depth prediction network structure based on DenseNet module

网络以调整大小后的256x512的双目RGB图像对输入，在该网络的encoder部分，经过一次64个7x7卷积核卷积的conv和最大池化maxpool，图像尺寸变为原来的1/4，通道数为64。The network is input with a resized 256x512 binocular RGB image pair. In the encoder part of the network, after a convolution of 64 7x7 convolution kernels and a maximum pooling maxpool, the image size becomes 1/4 of the original, The number of channels is 64.

数据随后进入四个由denseblock和过渡层组合而成的模块中，设定所有denseblock中dense layer的增长率(growth_rate)为32，默认bn_size＝4，denselayer的Bottleneck层增加Conv1x1，输出128通道(bn_size*growth_rate)，在Conv3x3处输出32通道(growth_rate)。过渡层使用了平均池化整合全局信息，设定输出通道数为输入通道数的一半。The data then enters four modules composed of denseblock and transition layer. Set the growth rate (growth_rate) of the dense layer in all denseblocks to 32, the default bn_size=4, the Bottleneck layer of the denselayer increases Conv1x1, and outputs 128 channels (bn_size *growth_rate), output 32 channels (growth_rate) at Conv3x3. The transition layer uses average pooling to integrate global information, and sets the number of output channels to half of the number of input channels.

2、分别对双目匹配损失与深度平滑损失进行算法优化2. Algorithm optimization for binocular matching loss and depth smoothing loss

(1)双目匹配损失计算是立体匹配算法的重要衡量标准。本发明基于重建技术，本文在RGB和视差两个层面使用图像重建技术，比较重建图像与被采样图像的相对视图的相似度，聚合计算代价：(1) Binocular matching loss calculation is an important measure of stereo matching algorithm. The present invention is based on reconstruction technology. This paper uses image reconstruction technology at two levels of RGB and parallax, compares the similarity between the reconstructed image and the relative view of the sampled image, and aggregates the calculation cost:

使用L1和单尺度SSIM项组合作为图像重建的光度成本C_ap，促进重建的图像与训练输入外观接近。假设原来的左图为

(i，j表示像素点的位置坐标)，根据其预测的视差以及原有的右图，可以通过扭曲操作得到重构后的左图。这里的扭曲操作得出的

是根据左图每个像素点对应的视差值，在右图中寻找对应的像素点再插值得到的。C_ap通过比较输入的原图

和和重建图片

得出，使用了包含3×3块滤波器而不是高斯滤波器的简化SSIM，其中N是像素数：Using a combination of L1 and single-scale SSIM terms as the photometric cost C _ap of image reconstruction promotes reconstructed images that are close in appearance to the training input. Suppose the original left image is

(i, j represent the position coordinates of the pixel point), according to the predicted disparity and the original right image, the reconstructed left image can be obtained through the warping operation. Here the warp operation results in

It is obtained by finding the corresponding pixel in the right image and then interpolating according to the parallax value corresponding to each pixel in the left image. C _ap by comparing the input original image

and reconstruct the picture

It follows that a simplified SSIM containing a 3×3 block filter instead of a Gaussian filter is used, where N is the number of pixels:

为了生成更加准确的视差图，训练网络以预测左右图像的视差，同时只将左视图作为网络卷积部分的输入，输出不仅是以左图为参考图像的视差图d^l，还有以右图为参考图像的视差图d^r。为了确保一致性，引入了L1左右视差一致性损失作为模型的一部分。该代价试图使左视点视差图等于投影的右视点视差图，具体地说，将以右图为参考图像的d^r作为重建操作的输入图像，再以左图为参考图像的d^l作为输入的视差图，经过扭曲操作就会得到d^l的重构视差图

注意，这里得到的是重建的视差图，而非重构的左图,因此，左右视差一致性损失可以写作C_lr，促进预测的左右视差一致。In order to generate a more accurate disparity map, the network is trained to predict the disparity of the left and right images, and only the left view is used as the input of the network convolution part, and the output is not only the disparity map d ^l with the left image as the reference image, but also the right image is the disparity map d ^r of the reference image. To ensure consistency, an L1 left-right disparity consistency loss is introduced as part of the model. This cost tries to make the left-view disparity map equal to the projected right-view disparity map, specifically, taking d ^r with the right image as the reference image as the input image for the reconstruction operation, and d ^l with the left image as the reference image as the input The disparity map, after the distortion operation, the reconstructed disparity map of d ^l will be obtained

Note that what is obtained here is the reconstructed disparity map, not the reconstructed left image. Therefore, the left and right disparity consistency loss can be written as C _lr to promote the consistency of the predicted left and right disparity.

此处N是像素的个数，

即为重建出来的

和其他所有的项一样，该代价也对应了右视点的视差图，可以在所有输出尺度上预测。Here N is the number of pixels,

to be reconstructed

Like all other terms, this cost also corresponds to the right-view disparity map, which can be predicted at all output scales.

(2)预测的视差图轮廓边缘灰度变化明显，层次感强，在图像梯度处经常会出现深度不连续的问题。由于本文需要密集的视差图，不仅需要在局部保持平滑，同时也要保留物体边缘和遮挡的信息。(2) The gray level of the edge of the predicted disparity map contour changes significantly, and the sense of hierarchy is strong, and the problem of depth discontinuity often appears at the image gradient. Since this paper requires a dense disparity map, it is not only necessary to maintain local smoothness, but also to preserve the information of object edges and occlusions.

为了视差在局部上保持平滑，本文对视差梯度

进行L1惩罚。然而，这种假设会错误地惩罚边缘，并不适用于通常对应于像素强度高变化区域的物体边界。因此，通过引入一个边缘感知项e，使用图像梯度

通过边缘感知项减少错误惩罚带来的影响。视差平滑损失

定义如下：In order to keep the disparity locally smooth, this paper uses the disparity gradient

Perform L1 penalty. However, this assumption incorrectly penalizes edges and does not apply to object boundaries that typically correspond to regions of high variation in pixel intensity. Therefore, by introducing an edge-aware term e, using the image gradient

Reduce the impact of error penalties through edge-aware terms. Parallax smoothing loss

It is defined as follows:

这里是分别使用了x方向和y方向的视差梯度

和图像梯度

Here are the disparity gradients using the x direction and the y direction respectively

and image gradient

3、在后处理部分，不仅缓解了图像边缘的遮挡问题，还有效消除了运动物体的拖影现象，优化了输出的视差图。3. In the post-processing part, not only the occlusion problem of the image edge is alleviated, but also the smear phenomenon of moving objects is effectively eliminated, and the output disparity map is optimized.

由于立体遮挡的存在，左视图能看见左侧更多的内容，容易丢失物体右侧部分的信息，因此对于输入图片I，不仅计算其视差图D_l，也对图片I的镜面翻转图像I′计算视差图D_l′，选取两幅视差图合适的边缘信息，优化输出的视差，以减少图像边缘立体遮挡的影响。Due to the existence of three-dimensional occlusion, the left view can see more content on the left side, and it is easy to lose the information on the right side of the object. Therefore, for the input picture I, not only the disparity map D _l is calculated, but also the mirror image I′ of the picture I is calculated. Calculate the disparity map D _l ′, select the appropriate edge information of the two disparity maps, and optimize the output disparity to reduce the impact of image edge stereo occlusion.

需要说明的是上述实施例仅仅是本发明的较佳实施例，并没有用来限定本发明的保护范围，在上述技术方案的基础上做出的等同替换或者替代均属于本发明的保护范围。It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not used to limit the protection scope of the present invention. Equivalent replacements or replacements made on the basis of the above technical solutions all belong to the protection scope of the present invention.

Claims

1. A method for predicting the depth of a monocular scene based on deep learning, characterized in that the method comprises the following steps:

Step 1: Preprocessing operation, resize the high-resolution binocular color image pair to 256x512, and perform random flip and contrast transformation on the unified image pair to increase the amount of input data. Input into the encoder of the convolutional network,

Step 2: In the network encoder part, use the DenseNet convolution module to extract visual features, improve the transmission of information and gradients in the network through dense connections, alleviate the problem of gradient disappearance, and strengthen feature propagation;

Step 3: Set the skip scope in the decoder part, directly stitch some feature maps of the encoding process into the decoding process, use 64 7x7 convolution kernel convolutions to achieve upsampling, and use the sigmoid function as the activation function to generate parallax;

Step 4: Enhance the binocular matching loss and depth smoothing loss, optimize the model in iterations, improve the prediction accuracy, and make the depth map smooth while maintaining its edges;

Step 5: Optimize the post-processing part, reverse the input image to generate the corresponding parallax and combine the parallax of the original image to alleviate the edge occlusion problem, based on the object detection technology, align the original image with the parallax image, and strengthen the pixels on the edge of the object, to a certain extent Eliminate smearing phenomenon.

2. The monocular scene depth prediction method based on deep learning according to claim 1, wherein step 2 is specifically as follows, the training data set of the depth prediction model is a calibrated pair of binocular color images, and the size is adjusted It is 256x512 as the input of the network, after a 64 7x7 convolution kernel convolution and maximum pooling, it is obtained

The layers contained in the four denseblocks are 2, 6, 12, and 24 respectively, so that the number of network layers is continuously deepened, and the growth rate (growth_rate) of the dense layer in all denseblocks is set to 32, and the default bn_size is 4. In the dense block The Bottleneck layer adds a 1x1 convolution before the 3x3 convolution.

3. the monocular scene depth prediction method based on deep learning according to claim 1, is characterized in that, described step 4 is specifically as follows:

(3.1) Binocular matching loss: The matching cost calculation is an important measure of the stereo matching algorithm, using the correlation between binocular image pairs of pixels,

In the case of spatial similarity, there is a strong correlation between the pixels of the RGB image, assuming that the original left image is

(i, j represent the position coordinates of the pixel point), according to the predicted parallax and the original right image, the reconstructed left image is obtained through the distortion operation, and the distortion operation here is obtained

view with real input

the matching cost of

At the disparity level, make the disparity map of the left viewpoint equal to the projected disparity map of the right viewpoint, take the d ^r of the right image as the reference image as the input image of the reconstruction operation, and then use the d ^l of the left image as the reference image as the input disparity map, After the warping operation, the reconstructed disparity map of d ^l will be obtained

C _lr promotes consistent left and right disparity predictions,

(3.2) Depth smoothing loss: by introducing an edge-aware term e, using the image gradient

Reduce the impact of error penalties through edge-aware terms,

On the basis of introducing the edge-aware item e, add a crossover-based self-adaptive method to find the field. For point p, extend four arms in the upper, lower, left, and right sides respectively, and bring the points that meet the conditions into the support area U _p of point p, and iterate continuously Terminate until the following conditions are not met;

4. The monocular scene depth prediction method based on deep learning according to claim 1, wherein step (5) is specifically as follows, the generated disparity map is improved, and in alleviating the occlusion of the image edge, for the input picture I, not only calculate the disparity map D _l , but also calculate the disparity map D _l ′ for the mirror flip image I′ of the picture I, and then flip the disparity map to obtain the disparity map D _l ″, D _l ″ and D _l are aligned; combined with D The left 5% of _l ″ and the right 5% of D _l , and the average between the two form the final result, so as to reduce the influence of image edge stereoscopic occlusion.