CN115861401A

CN115861401A - Binocular and point cloud fusion depth recovery method, device and medium

Info

Publication number: CN115861401A
Application number: CN202310170221.2A
Authority: CN
Inventors: 许振宇; 李月华; 朱世强; 邢琰; 姜甜甜
Original assignee: Beijing Institute of Control Engineering; Zhejiang Lab
Current assignee: Beijing Institute of Control Engineering; Zhejiang Lab
Priority date: 2023-02-27
Filing date: 2023-02-27
Publication date: 2023-03-28
Anticipated expiration: 2043-02-27
Also published as: CN115861401B

Abstract

The invention discloses a binocular and point cloud fusion depth recovery method, a device and a medium, wherein a depth recovery neural network is constructed by the method and comprises a sparse extension module, a multi-scale feature extraction and fusion module, a variable weight Gaussian modulation module and a cascaded three-dimensional convolution neural network module, sparse point cloud data and a binocular image are used as input, a semi-dense depth image is obtained through neighborhood extension, feature extraction and fusion are carried out based on the image and the binocular image, a cost volume is constructed, a variable weight Gaussian modulation function is used for modulation, cost aggregation is carried out through a deep learning network, and recovery of dense depth information is realized. On the basis of a binocular stereo matching network, sparse point cloud is introduced, the density of guide points is improved by a neighborhood expansion method, and a Gaussian modulation and multi-scale feature extraction fusion method is comprehensively adopted, so that the method is beneficial to improving the precision and the robustness of depth recovery, and is an effective method for dense depth recovery in real application.

Description

A binocular and point cloud fusion depth recovery method, device and medium

技术领域Technical Field

本发明涉及计算机视觉领域，尤其涉及一种双目与点云融合深度恢复方法、装置和介质。The present invention relates to the field of computer vision, and in particular to a binocular and point cloud fusion depth recovery method, device and medium.

背景技术Background Art

深度恢复是计算机视觉中的一个非常重要的应用，被广泛应用于机器人、自动驾驶、三维重建等诸多领域。Depth recovery is a very important application in computer vision and is widely used in many fields such as robotics, autonomous driving, and 3D reconstruction.

相较于传统的双目立体匹配深度恢复方法，双目与稀疏点云融合的深度恢复算法引入了源于激光雷达、TOF相机等传感器的高精度稀疏点云作为先验信息，对深度的恢复起到了引导性的作用。尤其在纹理特征弱、遮挡、域变化大等场景中，稀疏点云所提供的深度信息可以有效提升深度恢复的精度及鲁棒性。Compared with the traditional binocular stereo matching depth recovery method, the binocular and sparse point cloud fusion depth recovery algorithm introduces high-precision sparse point clouds from sensors such as lidar and TOF cameras as prior information, which plays a guiding role in depth recovery. Especially in scenes with weak texture features, occlusion, and large domain changes, the depth information provided by sparse point clouds can effectively improve the accuracy and robustness of depth recovery.

现有的双目与点云融合深度恢复算法主要分为点云引导代价聚合、点云信息融合两类，且两类算法均直接利用原始的稀疏点云进行融合或引导处理。但是，由于输入点云数据的稀疏性，点云引导代价聚合的方法实际引导信息有限，且仅在深度范围上进行调制引导，无法在图像维度上提供更加充足的先验信息。而对于点云信息融合的方法，直接融合或者特征融合都因数据存在的不连续性，导致提取的融合信息纹理性偏弱。The existing binocular and point cloud fusion depth recovery algorithms are mainly divided into two categories: point cloud guided cost aggregation and point cloud information fusion. Both algorithms directly use the original sparse point cloud for fusion or guidance processing. However, due to the sparsity of the input point cloud data, the point cloud guided cost aggregation method has limited actual guidance information and only performs modulation guidance in the depth range, which cannot provide more sufficient prior information in the image dimension. As for the point cloud information fusion method, direct fusion or feature fusion results in weak texture of the extracted fusion information due to the discontinuity of the data.

发明内容Summary of the invention

本发明的目的在于针对现有技术的不足，提供一种双目与点云融合深度恢复方法、装置和介质。The purpose of the present invention is to provide a binocular and point cloud fusion depth recovery method, device and medium to address the deficiencies of the prior art.

本发明的目的是通过以下技术方案来实现的：本发明实施例第一方面提供了一种双目与点云融合深度恢复方法，包括以下步骤：The objective of the present invention is achieved through the following technical solutions: In a first aspect, an embodiment of the present invention provides a binocular and point cloud fusion depth recovery method, comprising the following steps:

（1）构建深度恢复网络，所述深度恢复网络包括稀疏扩展模块、多尺度特征提取及融合模块、可变权重高斯调制模块和级联三维卷积神经网络模块；所述深度恢复网络的输入为双目图像及稀疏点云数据，所述深度恢复网络的输出为稠密深度图像；(1) constructing a deep restoration network, the deep restoration network comprising a sparse expansion module, a multi-scale feature extraction and fusion module, a variable weight Gaussian modulation module and a cascaded three-dimensional convolutional neural network module; the input of the deep restoration network is a binocular image and sparse point cloud data, and the output of the deep restoration network is a dense depth image;

（2）训练所述步骤（1）构建的深度恢复网络，利用双目数据集，输入双目图像及稀疏点云数据，将稀疏点云数据投影到左目相机坐标系生成稀疏深度图，对比深度真值图像，对双目图像和稀疏深度图进行数据增强，计算输出稠密深度图像的损失值，反向传播网络迭代更新网络权重；(2) training the depth recovery network constructed in step (1), using a binocular data set, inputting a binocular image and sparse point cloud data, projecting the sparse point cloud data into the left camera coordinate system to generate a sparse depth map, comparing the true depth image, performing data enhancement on the binocular image and the sparse depth map, calculating the loss value of the output dense depth image, and iteratively updating the network weights using a back propagation network;

（3）在所述步骤（2）训练得到的深度恢复网络中输入待测试的双目图像及稀疏点云数据，利用传感器标定参数，将稀疏点云数据投影到左目相机坐标系生成稀疏深度图像，以输出稠密深度图像。(3) Inputting the binocular image to be tested and the sparse point cloud data into the depth recovery network trained in step (2), using the sensor calibration parameters, projecting the sparse point cloud data into the left camera coordinate system to generate a sparse depth image, and outputting a dense depth image.

进一步地，所述稀疏扩展模块具体为：以图像的多通道信息为引导，通过邻域扩展的方法，以提升稀疏点云数据的稠密度，并输出半稠密深度图。Furthermore, the sparse expansion module is specifically: guided by the multi-channel information of the image, the density of the sparse point cloud data is improved through the neighborhood expansion method, and a semi-dense depth map is output.

进一步地，构建所述稀疏扩展模块包括以下子步骤：Furthermore, constructing the sparse extension module includes the following sub-steps:

（a1）根据点云数据与左目相机图像之间的位姿关系获取稀疏深度图，分别提取稀疏深度图中的有效点的像素坐标和对应的图像多通道数值及其邻域内点的图像多通道数值；(a1) Obtain a sparse depth map based on the pose relationship between the point cloud data and the left camera image, and extract the pixel coordinates of the valid points in the sparse depth map and the corresponding image multi-channel values and the image multi-channel values of the points in their neighborhood respectively;

（a2）根据有效点的像素坐标对应的图像多通道数值与其邻域内点的图像多通道数值计算平均图像数值偏差；(a2) calculating the average image value deviation based on the image multi-channel values corresponding to the pixel coordinates of the valid points and the image multi-channel values of the points in its neighborhood;

（a3）根据有效点的平均图像数值偏差与设置的固定门限，将稀疏深度图扩展为半稠密深度图，并输出半稠密深度图。(a3) According to the average image value deviation of the valid points and the set fixed threshold, the sparse depth map is expanded into a semi-dense depth map, and the semi-dense depth map is output.

进一步地，所述多尺度特征提取及融合模块具体为：以稀疏扩展模块输出的半稠密深度图及双目图像为输入，采用Unet编码器译码器结构，结合空间金字塔池化方法，以提取点云特征、左目图像特征和右目图像特征，进而在特征层以级联方式融合左目图像特征与点云特征，以得到融合特征。Furthermore, the multi-scale feature extraction and fusion module is specifically as follows: taking the semi-dense depth map and binocular image output by the sparse extension module as input, adopting the Unet encoder-decoder structure, combined with the spatial pyramid pooling method, to extract point cloud features, left-eye image features and right-eye image features, and then fusing the left-eye image features and the point cloud features in a cascade manner at the feature layer to obtain fused features.

进一步地，构建所述多尺度特征提取及融合模块包括以下子步骤：Furthermore, constructing the multi-scale feature extraction and fusion module includes the following sub-steps:

（b1）分别对稀疏扩展模块输出的半稠密深度图及双目图像进行多层下采样编码，以获取多个尺度下采样编码后的左目图像特征、右目图像特征和点云特征；(b1) Perform multi-layer downsampling encoding on the semi-dense depth map and binocular image output by the sparse extension module to obtain left-eye image features, right-eye image features and point cloud features after downsampling encoding at multiple scales;

（b2）分别对最低分辨率的下采样编码后的左目图像特征、右目图像特征和点云特征进行空间金字塔池化处理，以得到池化处理后的结果；(b2) performing spatial pyramid pooling processing on the left-eye image features, right-eye image features and point cloud features after the lowest resolution downsampling encoding, respectively, to obtain the pooled results;

（b3）分别将左目图像特征、右目图像特征和点云特征池化处理后的结果进行多层上采样解码，以获取多个尺度上采样解码后的左目图像特征、右目图像特征和点云特征；(b3) performing multi-layer upsampling decoding on the results of the pooling processing of the left-eye image features, the right-eye image features and the point cloud features, respectively, to obtain the left-eye image features, the right-eye image features and the point cloud features after upsampling decoding at multiple scales;

（b4）将上采样解码后的左目图像特征与点云特征在特征维度上进行级联，得到左目图像特征与点云特征的融合特征。(b4) The upsampled and decoded left-eye image features and point cloud features are cascaded in the feature dimension to obtain the fusion features of the left-eye image features and the point cloud features.

进一步地，所述可变权重高斯调制模块具体为：以半稠密深度图的数据可靠性为依据，生成不同权重的高斯调制函数，对代价卷不同像素位置上的深度维度进行调制。Furthermore, the variable weight Gaussian modulation module is specifically: based on the data reliability of the semi-dense depth map, Gaussian modulation functions with different weights are generated to modulate the depth dimension at different pixel positions of the cost volume.

进一步地，构建所述可变权重高斯调制模块包括以下子步骤：Furthermore, constructing the variable weight Gaussian modulation module includes the following sub-steps:

（c1）根据融合特征及右目图像特征，采用级联的方式构造代价卷；(c1) Based on the fusion features and the right-eye image features, the cost volume is constructed in a cascade manner;

（c2）根据稀疏点云的可靠性，分别构造不同权重的高斯调制函数；(c2) According to the reliability of the sparse point cloud, Gaussian modulation functions with different weights are constructed respectively;

（c3）根据构造的高斯调制函数对代价卷进行调制，以获取调制后的代价卷。(c3) Modulate the cost volume according to the constructed Gaussian modulation function to obtain the modulated cost volume.

进一步地，构建所述级联三维卷积神经网络模块包括以下子步骤：Furthermore, constructing the cascaded three-dimensional convolutional neural network module includes the following sub-steps:

（d1）将低分辨率的代价卷通过三维卷积神经网络进行代价卷融合及代价卷聚合，以获取聚合后的代价卷；(d1) The low-resolution cost volume is fused and aggregated through a three-dimensional convolutional neural network to obtain an aggregated cost volume;

（d2）采用softmax函数获取每个像素坐标上所有深度值的softmax值，以得到低分辨率的深度图；(d2) Use the softmax function to obtain the softmax value of all depth values at each pixel coordinate to obtain a low-resolution depth map;

（d3）根据低分辨率的深度图进行上采样，以获取高分辨率的深度图的预测结果，并通过三次级联的迭代处理，以得到完整分辨率下的稠密深度图。(d3) Upsample the low-resolution depth map to obtain the prediction result of the high-resolution depth map, and then iterate through three cascades to obtain a dense depth map at the full resolution.

本发明实施例第二方面提供了一种双目与点云融合深度恢复装置，包括一个或多个处理器，用于实现上述的双目与点云融合深度恢复方法。A second aspect of an embodiment of the present invention provides a binocular and point cloud fusion depth recovery device, comprising one or more processors, for implementing the above binocular and point cloud fusion depth recovery method.

本发明实施例第三方面提供了一种计算机可读存储介质，其上存储有程序，该程序被处理器执行时，用于实现上述的双目与点云融合深度恢复方法。A third aspect of an embodiment of the present invention provides a computer-readable storage medium having a program stored thereon, which, when executed by a processor, is used to implement the above-mentioned binocular and point cloud fusion depth recovery method.

本发明的有益效果是，本发明是基于点云与双目融合恢复稠密深度的，以稀疏点云数据及双目图像为输入，通过邻域扩展得到半稠密的深度图像，并基于该深度图像及双目图像进行特征提取及特征融合，构建代价卷，利用可变权重的高斯调制函数调制代价卷，并通过深度学习网络进行代价聚合，实现稠密深度信息的恢复；本发明是在双目立体匹配深度恢复网络的设计基础上，引入稀疏点云，以邻域扩展方法提升引导点稠密度，并基于此，采用高斯调制引导的方法及多尺度特征提取及融合的方法，辅助提升深度恢复的精度及鲁棒性。本发明依赖能够提供双目图像数据和稀疏点云数据的传感器设备，有助于提高精度及鲁棒性，是真实应用中稠密深度恢复的有效方法。The beneficial effect of the present invention is that the present invention is based on the fusion of point cloud and binocular to restore dense depth, with sparse point cloud data and binocular image as input, and obtains semi-dense depth image through neighborhood expansion, and performs feature extraction and feature fusion based on the depth image and binocular image, constructs cost volume, modulates cost volume with Gaussian modulation function with variable weight, and performs cost aggregation through deep learning network to realize the recovery of dense depth information; the present invention is based on the design of binocular stereo matching depth recovery network, introduces sparse point cloud, improves the density of guide point by neighborhood expansion method, and based on this, adopts Gaussian modulation guidance method and multi-scale feature extraction and fusion method to assist in improving the accuracy and robustness of depth recovery. The present invention relies on sensor equipment that can provide binocular image data and sparse point cloud data, which helps to improve accuracy and robustness, and is an effective method for dense depth recovery in real applications.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1 为网络总体架构图；Figure 1 is a diagram of the overall network architecture;

图2 为稀疏扩展示意图；Figure 2 is a schematic diagram of sparse expansion;

图3 为可变权重高斯调制示意图；Figure 3 is a schematic diagram of variable weight Gaussian modulation;

图4 为本发明的效果示意图；其中，图4中的a为输入的左目图像，图4中的b为输入的右目图像，图4中的c为输入稀疏点云重投影到左目坐标系下的图片，图4中的d为恢复得到的深度图片；FIG4 is a schematic diagram of the effect of the present invention; wherein a in FIG4 is the input left-eye image, b in FIG4 is the input right-eye image, c in FIG4 is the image of the input sparse point cloud reprojected to the left-eye coordinate system, and d in FIG4 is the restored depth image;

图5为本发明的双目与点云融合深度恢复装置的一种结构示意图。FIG5 is a schematic diagram of the structure of a binocular and point cloud fusion depth recovery device of the present invention.

具体实施方式DETAILED DESCRIPTION

下面根据附图详细说明本发明。The present invention will be described in detail below with reference to the accompanying drawings.

本发明的双目与点云融合深度恢复方法，如图1所示，包括如下步骤：The binocular and point cloud fusion depth recovery method of the present invention, as shown in FIG1 , comprises the following steps:

（1）构建深度恢复网络。(1) Construct a deep recovery network.

整体的网络架构设计基于开源的深度学习框架pytorch，在公开的双目立体匹配网络架构CF-NET上进行改造，构建稀疏扩展模块、多尺度特征提取及融合模块、可变权重高斯调制模块、级联三维卷积神经网络模块四个部分。另外，深度恢复网络的输入为双目图像及稀疏点云数据，深度恢复网络的输出为稠密深度图像。The overall network architecture design is based on the open source deep learning framework pytorch, and is modified on the public binocular stereo matching network architecture CF-NET, constructing four parts: sparse expansion module, multi-scale feature extraction and fusion module, variable weight Gaussian modulation module, and cascaded three-dimensional convolutional neural network module. In addition, the input of the deep recovery network is binocular images and sparse point cloud data, and the output of the deep recovery network is a dense depth image.

（1.1）构建稀疏扩展模块。(1.1) Construct sparse extension modules.

本模块的整体处理流程如图2所示，以图像的多通道信息为引导，通过邻域扩展的方法，可以提升稀疏点云数据的稠密度，并输出半稠密深度图。The overall processing flow of this module is shown in Figure 2. Guided by the multi-channel information of the image, the density of the sparse point cloud data can be improved through the neighborhood expansion method, and a semi-dense depth map can be output.

（a1）根据点云数据与左目相机图像之间的位姿关系，利用openCv重投影函数，将输入的稀疏点云数据投影到相机坐标系，得到稀疏深度图

，并定义深度值

大于0的深度信息是有效点，分别提取稀疏深度图D中的有效点

的像素坐标

和对应的图像多通道数值

及其邻域内点的图像多通道数值

，其中，

表示重投影到左目图像的稀疏深度图D在

坐标位置上的深度值，W、H分别表示图像的宽度及高度，在本实施例中W=960，H=512，

表示在像素坐标

下的c通道的图像多通道数值，

为通道数，RGB图像对应的C=3，α、β分别表示邻域内点的横坐标及纵坐标上的偏置值，

表示邻域的距离，本实施例中，取

=2。应当理解的是，C还可以取其它数值，例如，RGBA图像对应的C=4。(a1) According to the pose relationship between the point cloud data and the left camera image, the openCv reprojection function is used to project the input sparse point cloud data into the camera coordinate system to obtain a sparse depth map.

, and define the depth value

Depth information greater than 0 is a valid point, and the valid points in the sparse depth map D are extracted respectively.

The pixel coordinates

And the corresponding image multi-channel values

The multi-channel values of the image points in its neighborhood

,in,

Represents the sparse depth map D reprojected to the left eye image

The depth value at the coordinate position, W, H represent the width and height of the image respectively. In this embodiment, W=960, H=512,

Represented in pixel coordinates

The multi-channel value of the image of the c channel,

is the number of channels, C=3 for RGB images, α and β represent the offset values on the horizontal and vertical coordinates of the points in the neighborhood, respectively.

Represents the distance of the neighborhood. In this embodiment,

=2. It should be understood that C may also take other values, for example, C=4 for RGBA images.

（a2）根据有效点的像素坐标

对应的图像多通道数值

与其邻域内点的图像多通道数值

计算平均图像数值偏差

，平均图像数值偏差

的表达式为：(a2) According to the pixel coordinates of the effective point

The corresponding image multi-channel values

Multi-channel values of the image and its neighborhood points

Calculate the mean image value deviation

, average image value deviation

The expression is:

其中，C为通道数，

表示在像素坐标

下的c通道的图像多通道数值，

表示在像素坐标

的邻域内点的c通道的图像数值，α、β分别表示邻域内点的横坐标及纵坐标上的偏置值，

，

，

表示邻域的距离。Where C is the number of channels,

Represented in pixel coordinates

The multi-channel value of the image of the c channel,

Represented in pixel coordinates

The image value of the c channel of the point in the neighborhood, α and β represent the offset values on the horizontal and vertical coordinates of the point in the neighborhood, respectively.

,

Represents the distance of the neighborhood.

（a3）针对每一个有效点的像素坐标

，将平均图像数值偏差

与固定门限（Threshold）对比，该固定门限（Threshold）表示像素扩展的难易程度，可以根据最终深度恢复的精度进行调整，在本实施例中，我们将固定门限（Threshold）设置为8，然后可以通过下列公式将稀疏深度图D扩展为半稠密深度图D_exp，完成全部有效点的邻域扩展后，可以得到最终的半稠密深度图，并输出半稠密深度图：(a3) Pixel coordinates for each valid point

, the average image value deviation

Compared with the fixed threshold (Threshold), the fixed threshold (Threshold) indicates the difficulty of pixel expansion and can be adjusted according to the accuracy of the final depth recovery. In this embodiment, we set the fixed threshold (Threshold) to 8, and then the sparse depth map D can be expanded to a semi-dense depth map D _exp by the following formula. After the neighborhood expansion of all valid points is completed, the final semi-dense depth map can be obtained and the semi-dense depth map is output:

其中，

表示重投影到左目图像的稀疏深度图D在

坐标位置上的深度值，

表示重投影到左目图像的稀疏深度图D在

坐标位置上的深度值，α、β分别表示邻域内点的横坐标及纵坐标上的偏置值，

，

，

表示邻域的距离。in,

Represents the sparse depth map D reprojected to the left eye image

The depth value at the coordinate position,

Represents the sparse depth map D reprojected to the left eye image

The depth value at the coordinate position, α and β represent the offset values on the horizontal and vertical coordinates of the points in the neighborhood respectively.

,

Represents the distance of the neighborhood.

（1.2）构建多尺度特征提取及融合模块。(1.2) Construct a multi-scale feature extraction and fusion module.

本模块以稀疏拓展模块输出的半稠密深度图及双目图像为输入，分别采用Unet编码器译码器结构，结合空间金字塔池化方法，可以提取点云特征、左目图像特征和右目图像特征，进而在特征层以级联方式融合左目图像特征与点云特征，从而得到融合特征。This module takes the semi-dense depth map and binocular image output by the sparse extension module as input, and adopts the Unet encoder-decoder structure, combined with the spatial pyramid pooling method, to extract point cloud features, left eye image features and right eye image features, and then fuses the left eye image features and point cloud features in a cascade manner at the feature layer to obtain fused features.

（b1）分别对半稠密深度图及双目图像进行多层下采样编码，可以得到多个尺度下采样编码后的左目图像特征

、右目图像特征

和点云特征

，其中，

表示下采样编码后第i层的特征维度，在本实施例中将半稠密深度图及双目图像通过5级残差块进行下采样编码，此时，

，W=960，H=512。(b1) Perform multi-layer downsampling encoding on the semi-dense depth map and the binocular image respectively, and obtain the left-eye image features after downsampling encoding at multiple scales.

, right eye image features

and point cloud features

,in,

represents the feature dimension of the i-th layer after downsampling coding. In this embodiment, the semi-dense depth map and the binocular image are downsampled and coded through a 5-level residual block. At this time,

, W=960, H=512.

（b2）分别对最低分辨率的下采样编码后的左目图像特征、右目图像特征和点云特征进行空间金字塔池化处理，以得到池化处理后的结果，池化处理后的结果分别表示为：(b2) The left-eye image features, right-eye image features and point cloud features after the lowest resolution downsampling encoding are subjected to spatial pyramid pooling to obtain the pooled results. The pooled results are expressed as follows:

其中，

表示池化函数，意味着对下采样编码后的特征进行空间金字塔池化处理，N表示下采样编码的最大层数，

表示左目图像特征池化结果，

表示右目图像特征池化结果，

表示点云特征池化结果。in,

represents the pooling function, which means that the features after downsampling encoding are processed by spatial pyramid pooling. N represents the maximum number of layers of downsampling encoding.

Indicates the feature pooling result of the left image.

Indicates the feature pooling result of the right image.

Represents the point cloud feature pooling result.

本实施例中，采用类似于公开网络HSMNet的空间金字塔池化方法对最低分辨率的下采样编码后的左目图像特征、右目图像特征和点云特征进行4级平均池化，每一级的池化大小分别为

，池化函数用

来表示，则池化处理后的结果分别可以表示为：In this embodiment, a spatial pyramid pooling method similar to the public network HSMNet is used to perform 4-level average pooling on the left-eye image features, right-eye image features and point cloud features after the lowest resolution downsampling encoding, and the pooling size of each level is

, the pooling function is used

To express it, the results after pooling can be expressed as:

（b3）分别将左目图像特征、右目图像特征和点云特征池化处理后的结果进行多层上采样解码，可以得到多个尺度上采样解码后的左目图像特征

、右目图像特征

和点云特征

，其中，

表示上采样解码得到的第

层的特征维度，相应地，左目图像特征、右目图像特征和点云特征上采样解码后的结果分别表示为：(b3) Perform multi-layer upsampling decoding on the results of the pooling of the left-eye image features, the right-eye image features and the point cloud features, respectively, to obtain the left-eye image features after upsampling and decoding at multiple scales.

, right eye image features

and point cloud features

,in,

Indicates the first

The feature dimension of the layer, accordingly, the results of upsampling and decoding of the left image feature, the right image feature and the point cloud feature are expressed as:

其中，

表示向量级联函数，

表示上采样解码模块处理函数，N表示上采样解码的最大层数。in,

represents the vector concatenation function,

Represents the upsampling decoding module processing function, and N represents the maximum number of upsampling decoding layers.

在本实施例中，通过5级对应的上采样解码模块进行上采样解码，这里设置F₁=64，F₂=128，F₃=192，F₄=256，F₅=512，上采样解码后的结果分别表示为：In this embodiment, up-sampling decoding is performed through 5 levels of corresponding up-sampling decoding modules, where F ₁ =64, F ₂ =128, F ₃ =192, F ₄ =256, and F ₅ =512 are set. The results after up-sampling decoding are respectively expressed as:

（b4）将左目图像特征与点云特征在特征维度上进行级联，可以得到左目图像特征与点云特征的融合特征

，其表达式为：(b4) By concatenating the left-eye image features and the point cloud features in the feature dimension, we can obtain the fusion features of the left-eye image features and the point cloud features.

, whose expression is:

其中，

表示向量级联函数，

表示上采样解码后的左目图像特征，

表示上采样解码后的点云特征，i表示第i层的特征维度。in,

represents the vector concatenation function,

Represents the left image features after upsampling and decoding,

Represents the point cloud features after upsampling and decoding, and i represents the feature dimension of the i-th layer.

（1.3）构建可变权重高斯调制模块。(1.3) Construct a variable weight Gaussian modulation module.

以半稠密深度图像的数据可靠性为依据，生成不同权重的高斯调制函数，对代价卷不同像素位置上的深度维度进行调制。Based on the data reliability of the semi-dense depth image, Gaussian modulation functions with different weights are generated to modulate the depth dimension at different pixel positions of the cost volume.

（c1）根据融合特征及右目图像特征，采用级联的方式构造代价卷

，其中，

表示最大的视差搜索范围，在本实施例中

取值256，

表示代价卷的特征维度，

表示上采样解码得到的第

层的特征维度，W=960，H=512。(c1) Based on the fusion features and the right image features, the cost volume is constructed in a cascade manner

,in,

Indicates the maximum parallax search range. In this embodiment,

The value is 256.

Represents the feature dimension of the cost volume,

Indicates the first

The feature dimensions of the layer are W=960 and H=512.

（c2）根据稀疏点云的可靠性，分别构造不同权重的高斯调制函数，高斯调制函数的表达式为：(c2) According to the reliability of the sparse point cloud, Gaussian modulation functions with different weights are constructed respectively. The expression of the Gaussian modulation function is:

其中，k₁、c₁分别表示原始稀疏点云对应调制函数的权重及方差，k₂、c₂分别表示扩展得到的点云对应调制函数的权重及方差，本实施例中，k₁=10，c₁=1，k₂=2，c₂=8，

表示重投影到左目图像的稀疏深度图D在

坐标位置上的深度值，

表示重投影到左目图像的半稠密深度图D_exp在

坐标位置上的深度值，

、

分别为

、

的掩码，当对应的点有效（

）时，

、

的值设为1，否则设为0，d表示深度维度上的坐标。Wherein, k ₁ and c ₁ represent the weight and variance of the modulation function corresponding to the original sparse point cloud, respectively; k ₂ and c ₂ represent the weight and variance of the modulation function corresponding to the expanded point cloud, respectively. In this embodiment, k ₁ =10, c ₁ =1, k ₂ =2, c ₂ =8,

Represents the sparse depth map D reprojected to the left eye image

The depth value at the coordinate position,

Represents the semi-dense depth map D _exp reprojected to the left eye image

The depth value at the coordinate position,

,

They are

,

, when the corresponding point is valid (

)hour,

,

The value of is set to 1, otherwise it is set to 0, and d represents the coordinate in the depth dimension.

（c3）根据构造的高斯调制函数对代价卷进行调制，得到调制后的代价卷

。具体地，对于所有代价卷

位置上的特征值

，其调制后的特征值表示为：(c3) Modulate the cost volume according to the constructed Gaussian modulation function to obtain the modulated cost volume

Specifically, for all cost volumes

Eigenvalues at positions

, its modulated eigenvalue is expressed as:

可变权重高斯调制模块的整体流程示意图如图3所示，对应的稀疏点云可以分为无效点、原始点和邻域扩展得到的点。The overall process diagram of the variable weight Gaussian modulation module is shown in Figure 3. The corresponding sparse point cloud can be divided into invalid points, original points and points obtained by neighborhood expansion.

具体地，无效点的

、

，故

、

，因此，无效点对应位置的代价卷保持不变；原始点的

、

，故

、

，因此，原始点对应位置代的价卷使用高权重低方差k₁=10，c₁=1的高斯调制函数进行调制；邻域扩展得到的点的

、

，故

、

，因此，邻域扩展得到的点可靠性偏差，使用低权重高方差k₂=2，c₂=8的高斯调制函数进行调制。Specifically, the invalid point

,

, so

,

Therefore, the cost volume corresponding to the invalid point remains unchanged; the original point

,

, so

,

Therefore, the valence volume of the original point corresponding to the position is modulated using a Gaussian modulation function with high weight and low variance k ₁ = 10, c ₁ = 1; the point obtained by neighborhood expansion

,

, so

,

,Therefore, the point reliability deviation obtained by neighborhood expansion is modulated using a Gaussian modulation function with low weight and high variance k ₂ =2, c ₂ =8.

（1.4）构建级联三维卷积神经网络模块。(1.4) Construct a cascaded three-dimensional convolutional neural network module.

（d1）采用公开网络CF-NET中的级联三维卷积神经网络的方法，将低分辨率的代价卷

通过沙漏型的三维卷积神经网络进行代价卷融合及代价卷聚合，得到聚合后的代价卷

。(d1) Using the cascaded 3D convolutional neural network method in the public network CF-NET, the low-resolution cost volume

The cost volume is fused and aggregated through the hourglass-shaped three-dimensional convolutional neural network to obtain the aggregated cost volume.

.

（d2）采用softmax函数获取每个像素坐标上所有深度值的softmax值，可以得到低分辨率的深度图

。(d2) The softmax function is used to obtain the softmax value of all depth values at each pixel coordinate to obtain a low-resolution depth map.

.

（d3）基于低分辨率的深度图进行上采样，可以得到高分辨率的深度图的预测结果

。以预测结果为中心，根据预测结果的可靠性来划定实际预测的深度的范围，作为高分辨率代价卷聚合的深度分布范围

。该分布范围递归到高分辨率代价卷的代价聚合过程，并通过沙漏型的三维卷积神经网络进行代价聚合，得到聚合后的高一级分辨率的代价卷

，其中，

表示当前的深度层次数，对应的实际深度值可以表示为

。同样，利用softmax函数获取每个像素坐标上所有深度值的softmax值，可以得到当前分辨率的深度图

。(d3) Based on the low-resolution depth map, upsampling can be performed to obtain the prediction result of the high-resolution depth map.

Centered on the prediction results, the actual predicted depth range is defined according to the reliability of the prediction results, which is used as the depth distribution range of the high-resolution cost volume aggregation.

The distribution range is recursively applied to the cost aggregation process of the high-resolution cost volume, and the cost is aggregated through an hourglass-shaped three-dimensional convolutional neural network to obtain a higher-resolution cost volume after aggregation.

,in,

Indicates the current depth level number, the corresponding actual depth value can be expressed as

Similarly, the softmax function is used to obtain the softmax value of all depth values at each pixel coordinate to obtain the depth map of the current resolution.

.

通过上述的过程进行3次级联的迭代处理，最终可以得到完整分辨率下的稠密深度图

，级联三维卷积神经网络的架构如图1中的级联三维卷积神经网络所示。Through the above process, three cascade iterations are performed to finally obtain a dense depth map at full resolution.

,The architecture of the cascaded 3D convolutional neural network is shown in the cascaded 3D convolutional neural network in Figure 1.

（2）训练所述步骤（1）构建的深度恢复网络，利用双目数据集，输入双目图像及稀疏点云数据，将稀疏点云数据投影到左目相机坐标系生成稀疏深度图，对比深度真值图像，对双目图像和稀疏深度图进行数据增强，计算输出稠密深度图像的损失值，反向传播网络迭代更新网络权重。(2) Training the depth recovery network constructed in step (1), using a binocular data set, inputting a binocular image and sparse point cloud data, projecting the sparse point cloud data into the left camera coordinate system to generate a sparse depth map, comparing the true depth image, performing data enhancement on the binocular image and the sparse depth map, calculating the loss value of the output dense depth image, and iteratively updating the network weights through a back propagation network.

本实施例中，可以选用开源的SceneFlow双目数据集作为任务样本；该数据集包含35454对双目图像及深度真值用于训练，7349对双目图像及深度真值用于测试。训练过程中，从深度真值中随机采样5%的点得到一个稀疏深度图，来模拟点云重投影的稀疏深度图，作为稀疏深度图的输入。In this embodiment, the open source SceneFlow binocular dataset can be used as a task sample; the dataset contains 35454 pairs of binocular images and depth truth values for training, and 7349 pairs of binocular images and depth truth values for testing. During the training process, 5% of the points are randomly sampled from the depth truth values to obtain a sparse depth map to simulate the sparse depth map of the point cloud reprojection as the input of the sparse depth map.

双目图像顺序使用随机遮挡、非对称颜色变换、随机剪裁等方法进行数据增强。其中，随机遮挡通过随机生成一个矩形坐标区域，将右图中对应区域内的所有坐标上的图像数据变换为平均图像数值来实现。非对称颜色变换则是分别对左右目图像使用不同的亮度、对比度、gamma数值变换处理来实现，对应的处理函数可以直接调用torchvision.transform下的adjust_brightness，adjust_contrast，adjust_gamma来实现，函数的参数利用随机生成函数来生成。随机剪裁则通过随机生成一个固定大小的矩形坐标区域，并裁剪掉其余区域图像信息来实现。稀疏深度图同样顺序使用随机遮挡、随机裁剪等方法来实现数据增强，其中，随机遮挡的位置随机生成，无需与双目图像的位置保持一致，而随机裁剪的区域则需与双目图像的裁剪位置一致，以确保双目图像信息与深度信息的对应。The binocular images are sequentially enhanced using random occlusion, asymmetric color transformation, random cropping and other methods. Among them, random occlusion is achieved by randomly generating a rectangular coordinate area and transforming the image data at all coordinates in the corresponding area in the right image into the average image value. Asymmetric color transformation is achieved by using different brightness, contrast, and gamma value transformation processing on the left and right images respectively. The corresponding processing function can be directly called adjust_brightness, adjust_contrast, and adjust_gamma under torchvision.transform. The function parameters are generated using random generation functions. Random cropping is achieved by randomly generating a rectangular coordinate area of a fixed size and cropping the image information in the remaining area. Sparse depth maps are also sequentially enhanced using random occlusion, random cropping and other methods. Among them, the position of random occlusion is randomly generated and does not need to be consistent with the position of the binocular image, while the randomly cropped area must be consistent with the cropping position of the binocular image to ensure the correspondence between the binocular image information and the depth information.

经过数据增强处理的双目图像及稀疏深度图像作为输入，送入到步骤（1）的深度恢复网络中，并使用Adam 优化器进行端到端的网络训练，L1损失函数用于评估恢复的深度图与深度真值之间的损失，按照神经网络常见的前向传播及反向传播流程实现迭代训练，训练的学习率初始可以设置为

，共迭代执行20轮，到第16轮，第18轮的时候，学习率降为原来的一半。学习率及迭代参数可以根据实际深度恢复精度结果进行调整。The binocular image and sparse depth image after data enhancement are used as input and sent to the depth recovery network in step (1). The Adam optimizer is used for end-to-end network training. The L1 loss function is used to evaluate the loss between the restored depth map and the true depth value. The iterative training is implemented according to the common forward propagation and back propagation process of the neural network. The initial learning rate of the training can be set to

, a total of 20 rounds of iterations are performed. At the 16th and 18th rounds, the learning rate is reduced to half of the original. The learning rate and iteration parameters can be adjusted according to the actual depth recovery accuracy results.

（3）任务验证过程中，如图4所示，在步骤（2）训练得到的深度恢复网络中输入待测试的双目图像（如图4中的a、图4中的b所示）及稀疏点云数据，利用传感器标定参数，将稀疏点云数据投影到左目相机坐标系生成稀疏深度图像（如图4中的c所示）作为输入，最终输出稠密深度图像（如图4中的d所示），完成可视化过程。(3) During the task verification process, as shown in Figure 4, the binocular image to be tested (as shown in Figure 4a and Figure 4b) and the sparse point cloud data are input into the deep restoration network trained in step (2). The sparse point cloud data is projected into the left camera coordinate system using the sensor calibration parameters to generate a sparse depth image (as shown in Figure 4c) as input, and finally a dense depth image (as shown in Figure 4d) is output to complete the visualization process.

与前述双目与点云融合深度恢复方法的实施例相对应，本发明还提供了双目与点云融合深度恢复装置的实施例。Corresponding to the above-mentioned embodiment of the binocular and point cloud fusion depth recovery method, the present invention also provides an embodiment of the binocular and point cloud fusion depth recovery device.

参见图5，本发明实施例提供的一种双目与点云融合深度恢复装置，包括一个或多个处理器，用于实现上述实施例中的双目与点云融合深度恢复方法。Referring to FIG. 5 , a binocular and point cloud fusion depth recovery device provided in an embodiment of the present invention includes one or more processors for implementing the binocular and point cloud fusion depth recovery method in the above embodiment.

本发明双目与点云融合深度恢复装置的实施例可以应用在任意具备数据处理能力的设备上，该任意具备数据处理能力的设备可以为诸如计算机等设备或装置。装置实施例可以通过软件实现，也可以通过硬件或者软硬件结合的方式实现。以软件实现为例，作为一个逻辑意义上的装置，是通过其所在任意具备数据处理能力的设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言，如图5所示，为本发明双目与点云融合深度恢复装置所在任意具备数据处理能力的设备的一种硬件结构图，除了图5所示的处理器、内存、网络接口、以及非易失性存储器之外，实施例中装置所在的任意具备数据处理能力的设备通常根据该任意具备数据处理能力的设备的实际功能，还可以包括其他硬件，对此不再赘述。The embodiment of the binocular and point cloud fusion depth recovery device of the present invention can be applied to any device with data processing capabilities, and the arbitrary device with data processing capabilities can be a device or apparatus such as a computer. The device embodiment can be implemented by software, or by hardware or a combination of software and hardware. Taking software implementation as an example, as a device in a logical sense, it is formed by the processor of any device with data processing capabilities in which it is located to read the corresponding computer program instructions in the non-volatile memory into the memory and run it. From the hardware level, as shown in Figure 5, it is a hardware structure diagram of any device with data processing capabilities in which the binocular and point cloud fusion depth recovery device of the present invention is located. In addition to the processor, memory, network interface, and non-volatile memory shown in Figure 5, any device with data processing capabilities in which the device in the embodiment is located can also include other hardware according to the actual function of the arbitrary device with data processing capabilities, which will not be repeated.

上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程，在此不再赘述。The implementation process of the functions and effects of each unit in the above-mentioned device is specifically described in the implementation process of the corresponding steps in the above-mentioned method, and will not be repeated here.

对于装置实施例而言，由于其基本对应于方法实施例，所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。For the device embodiment, since it basically corresponds to the method embodiment, the relevant parts can refer to the partial description of the method embodiment. The device embodiment described above is only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of the present invention. Ordinary technicians in this field can understand and implement it without paying creative work.

本发明实施例还提供一种计算机可读存储介质，其上存储有程序，该程序被处理器执行时，实现上述实施例中的双目与点云融合深度恢复方法。An embodiment of the present invention further provides a computer-readable storage medium having a program stored thereon. When the program is executed by a processor, the binocular and point cloud fusion depth recovery method in the above embodiment is implemented.

所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元，例如硬盘或内存。所述计算机可读存储介质也可以是任意具备数据处理能力的设备，例如所述设备上配备的插接式硬盘、智能存储卡（Smart Media Card，SMC）、SD卡、闪存卡（Flash Card）等。进一步的，所述计算机可读存储介质还可以既包括任意具备数据处理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据，还可以用于暂时地存储已经输出或者将要输出的数据。The computer-readable storage medium may be an internal storage unit of any device with data processing capability described in any of the aforementioned embodiments, such as a hard disk or a memory. The computer-readable storage medium may also be any device with data processing capability, such as a plug-in hard disk, a smart media card (SMC), an SD card, a flash card, etc. equipped on the device. Furthermore, the computer-readable storage medium may also include both an internal storage unit of any device with data processing capability and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capability, and may also be used to temporarily store data that has been output or is to be output.

以上实施例仅用于说明本发明的设计思想和特点，其目的在于使本领域内的技术人员能够了解本发明的内容并据以实施，本发明的保护范围不限于上述实施例。所以，凡依据本发明所揭示的原理、设计思路所作的等同变化或修饰，均在本发明的保护范围之内。The above embodiments are only used to illustrate the design ideas and features of the present invention, and their purpose is to enable those skilled in the art to understand the content of the present invention and implement it accordingly. The protection scope of the present invention is not limited to the above embodiments. Therefore, any equivalent changes or modifications made based on the principles and design ideas disclosed by the present invention are within the protection scope of the present invention.

Claims

1. A binocular and point cloud fusion depth recovery method is characterized by comprising the following steps:

(1) Constructing a depth recovery network, wherein the depth recovery network comprises a sparse extension module, a multi-scale feature extraction and fusion module, a variable weight Gaussian modulation module and a cascaded three-dimensional convolution neural network module; the input of the depth recovery network is a binocular image and sparse point cloud data, and the output of the depth recovery network is a dense depth image;

(2) Training the depth recovery network constructed in the step (1), inputting a binocular image and sparse point cloud data by using a binocular data set, projecting the sparse point cloud data to a left eye camera coordinate system to generate a sparse depth map, comparing a depth truth value image, performing data enhancement on the binocular image and the sparse depth map, calculating a loss value of an output dense depth image, and iteratively updating network weights by using a back propagation network;

(3) Inputting the binocular image and the sparse point cloud data to be tested in the depth recovery network obtained by training in the step (2), utilizing a sensor to calibrate parameters, projecting the sparse point cloud data to a left eye camera coordinate system to generate a sparse depth image, and outputting the dense depth image.

2. The binocular and point cloud fusion depth restoration method according to claim 1, wherein the sparse extension module is specifically: and (3) taking the multi-channel information of the image as a guide, and improving the density of the sparse point cloud data by a neighborhood expansion method, and outputting a semi-dense depth map.

3. The binocular and point cloud fusion depth restoration method according to claim 2, wherein constructing the sparse extension module comprises the sub-steps of:

(a1) Acquiring a sparse depth map according to the pose relationship between the point cloud data and the left eye camera image, and respectively extracting pixel coordinates of effective points in the sparse depth map, corresponding image multichannel values and image multichannel values of points in the neighborhood of the effective points;

(a2) Calculating the average image numerical value deviation according to the image multi-channel numerical value corresponding to the pixel coordinate of the effective point and the image multi-channel numerical value of the point in the neighborhood;

(a3) And expanding the sparse depth map into a semi-dense depth map according to the average image numerical deviation of the effective points and a set fixed threshold, and outputting the semi-dense depth map.

4. The binocular and point cloud fusion depth restoration method according to claim 1, wherein the multi-scale feature extraction and fusion module specifically comprises: the method comprises the steps of taking a semi-dense depth map and a binocular image output by a sparse extension module as input, adopting a Unet encoder decoder structure and combining a space pyramid pooling method to extract point cloud features, left eye image features and right eye image features, and further fusing the left eye image features and the point cloud features in a cascade mode on a feature layer to obtain fused features.

5. The binocular and point cloud fusion depth restoration method according to claim 4, wherein constructing the multi-scale feature extraction and fusion module comprises the sub-steps of:

(b1) Respectively carrying out multi-layer down-sampling coding on the semi-dense depth map and the binocular image output by the sparse extension module so as to obtain a plurality of scales of down-sampling coded left eye image features, right eye image features and point cloud features;

(b2) Respectively carrying out spatial pyramid pooling on the left eye image feature, the right eye image feature and the point cloud feature after the down-sampling coding with the lowest resolution ratio so as to obtain a result after the pooling;

(b3) Performing multi-layer up-sampling decoding on the left eye image characteristic, the right eye image characteristic and the result of the pooling processing of the point cloud characteristic respectively to obtain the left eye image characteristic, the right eye image characteristic and the point cloud characteristic which are subjected to the up-sampling decoding in multiple scales;

(b4) And cascading the left eye image features and the point cloud features after the up-sampling decoding on feature dimensions to obtain fusion features of the left eye image features and the point cloud features.

6. The binocular and point cloud fusion depth restoration method according to claim 1, wherein the variable weight gaussian modulation module is specifically: and generating Gaussian modulation functions with different weights according to the data reliability of the semi-dense depth map, and modulating the depth dimensions of the cost volume at different pixel positions.

7. The binocular and point cloud fusion depth restoration method according to claim 6, wherein constructing the variable weight Gaussian modulation module comprises the sub-steps of:

(c1) Constructing a cost volume in a cascading mode according to the fusion characteristics and the right-eye image characteristics;

(c2) Respectively constructing Gaussian modulation functions with different weights according to the reliability of the sparse point cloud;

(c3) And modulating the cost volume according to the constructed Gaussian modulation function to obtain the modulated cost volume.

8. The binocular and point cloud fusion depth restoration method according to claim 1, wherein constructing the cascaded three-dimensional convolutional neural network module comprises the sub-steps of:

(d1) Performing cost volume fusion and cost volume aggregation on the low-resolution cost volume through a three-dimensional convolution neural network to obtain an aggregated cost volume;

(d2) Obtaining softmax values of all depth values on each pixel coordinate by adopting a softmax function so as to obtain a low-resolution depth map;

(d3) And performing up-sampling according to the low-resolution depth map to obtain a prediction result of the high-resolution depth map, and performing three times of cascaded iterative processing to obtain a dense depth map under the complete resolution.

9. A binocular and point cloud fused depth recovery device, comprising one or more processors, for implementing the binocular and point cloud fused depth recovery method of any one of claims 1 to 8.

10. A computer-readable storage medium, having stored thereon a program which, when being executed by a processor, is adapted to carry out the binocular and point cloud fusion depth restoration method according to any one of claims 1 to 8.