CN111325782A

CN111325782A - Unsupervised monocular view depth estimation method based on multi-scale unification

Info

Publication number: CN111325782A
Application number: CN202010099283.5A
Authority: CN
Inventors: 丁萌; 姜欣言; 曹云峰; 李旭; 张振振
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2020-02-18
Filing date: 2020-02-18
Publication date: 2020-06-23

Abstract

The invention belongs to the technical field of image processing, and discloses an unsupervised monocular view depth estimation method based on multi-scale unity, comprising the following steps: S1: performing pyramid multi-scale processing on input stereo image pairs; S2: constructing a network framework for encoding and decoding ; S3: transfer the features extracted in the encoding stage to the inverse convolutional neural network to achieve feature extraction of input images of different scales; S4: uniformly upsample the disparity maps of different scales to the original input size; S5: use the input original image Perform image reconstruction with the corresponding disparity map; S6: Constrain the accuracy of image reconstruction; S7: Use the gradient descent method to train the network model; S8: Fit the corresponding disparity map according to the input image and the pre-trained model. The design of the invention does not need to use real depth data to supervise network training, and easily obtained binocular images are used as training samples, which greatly reduces the acquisition difficulty of network training and solves the problem of depth map holes caused by low-scale disparity map blurring.

Description

An unsupervised monocular view depth estimation method based on multi-scale unification

技术领域technical field

本发明涉及图像处理技术领域，尤其涉及一种基于多尺度统一的无监督单目视图深度估计方法。The invention relates to the technical field of image processing, in particular to an unsupervised monocular view depth estimation method based on multi-scale unification.

背景技术Background technique

随着科技的发展，信息的爆炸式增长，人们对于图像场景的注意力慢慢由二维向三维转换，物体的三维信息在日常生活中起到了极大的便利，其中，三维信息应用最广泛的莫过于驾驶场景的辅助驾驶系统。由于图像中包含丰富的信息，视觉传感器几乎覆盖了驾驶所需的所有相关信息，包括但不限于车道几何形状、交通标志、灯光，物体位置和速度等。在所有形式的视觉信息中，深度信息在驾驶辅助系统中起着非常重要的作用。例如，防撞系统通过计算障碍物与车辆之间的深度信息来发出碰撞警告。当行人与车辆之间的距离过小时，行人保护系统将自动采取措施使车辆减速。因此，只有获取到当前车辆与驾驶场景中其他交通参与者之间的深度信息，驾驶辅助系统才能准确地获得与外部环境的连接，从而使预警子系统能够正常工作。With the development of science and technology and the explosive growth of information, people's attention to image scenes is gradually shifting from two-dimensional to three-dimensional, and the three-dimensional information of objects has played a great role in daily life. Among them, three-dimensional information is the most widely used The most important thing is the assisted driving system in the driving scene. Due to the wealth of information contained in images, vision sensors cover almost all relevant information needed for driving, including but not limited to lane geometry, traffic signs, lights, object positions and speeds, etc. Among all forms of visual information, depth information plays a very important role in driver assistance systems. For example, collision avoidance systems issue collision warnings by calculating the depth information between the obstacle and the vehicle. When the distance between the pedestrian and the vehicle is too small, the pedestrian protection system will automatically take measures to slow down the vehicle. Therefore, only by obtaining the depth information between the current vehicle and other traffic participants in the driving scene, the driving assistance system can accurately obtain the connection with the external environment, so that the early warning subsystem can work normally.

目前市面上出现了许多可以获取深度信息的传感器，比如Sick公司的激光雷达。激光雷达可以生成稀疏的三维点云数据，但是它的缺点在于成本高昂，且使用场景十分有限，因此人们将目光转向从图像中恢复场景的三维结构信息。At present, there are many sensors on the market that can obtain depth information, such as Sick's LiDAR. Lidar can generate sparse 3D point cloud data, but its disadvantages are high cost and limited use scenarios, so people turn their attention to recovering 3D structural information of scenes from images.

传统的基于图像的深度估计所采用的方法，多是基于拍摄环境假设的几何约束和手工特征，应用较为广泛的方法如从运动中恢复结构，这种方法的优点在于实现的成本低，对拍摄环境要求不高、便于操作，但该方法的缺点是极易受到特征提取与匹配误差的影响，且只能获得较为稀疏的深度数据。The traditional image-based depth estimation methods are mostly based on geometric constraints and manual features assumed by the shooting environment, and widely used methods such as recovering structures from motion. The environment requirements are not high and it is easy to operate, but the disadvantage of this method is that it is easily affected by feature extraction and matching errors, and only relatively sparse depth data can be obtained.

随着卷积神经网络在其他视觉任务上大放异彩，许多研究人员开始探索深度学习方法在单目图像深度估计上的应用。人们利用神经网络的强大学习能力，设计各种模型来充分挖掘原图与深度图之间的联系，从而训练出可以根据输入图像预测场景深度，但是正如上文已经说过，场景的真实深度信息十分难得，也就意味着我们需要脱离场景的真实深度标签，采用无监督的方法来完成深度估计任务。其中一种无监督的方法是使用单目视频的时序信息作为监督信号，但是此类无监督的深度估计方法，由于采用的是运动过程中采集的视频信息，所以相机本身存在运动，图像序列之间相机的相对姿态是未知的，这就导致除了深度估计的网络之外，该方法还需要另外训练一个姿态估计网络，这无疑增加了原本就复杂的深度估计任务的难度。另外，由于单目视频的尺度不确定性，该方法只能得到相对的深度结果，即只能获得图像中各像素之间的相对远近，而无法获得图像中物体到相机的距离。另外，无监督深度估计的方法存在由于低尺度特征图细节模糊而导致得深度图纹理缺失乃至空洞的情况，直接影响了深度估计的精度。As Convolutional Neural Networks shined on other vision tasks, many researchers began to explore the application of deep learning methods for depth estimation in monocular images. People use the powerful learning ability of neural networks to design various models to fully exploit the connection between the original image and the depth map, so as to train the scene depth that can be predicted based on the input image, but as mentioned above, the real depth information of the scene It is very rare, which means that we need to break away from the real depth label of the scene and use an unsupervised method to complete the depth estimation task. One of the unsupervised methods is to use the time series information of the monocular video as the supervision signal, but this kind of unsupervised depth estimation method uses the video information collected during the motion process, so the camera itself has motion, and the image sequence The relative poses of the cameras are unknown, which leads to the need to train a pose estimation network in addition to the depth estimation network, which undoubtedly increases the difficulty of the already complex depth estimation task. In addition, due to the scale uncertainty of monocular video, this method can only obtain relative depth results, that is, it can only obtain the relative distance between pixels in the image, but cannot obtain the distance from the object in the image to the camera. In addition, the unsupervised depth estimation method has the situation that the depth map texture is missing or even hollow due to the blurred details of low-scale feature maps, which directly affects the accuracy of depth estimation.

发明内容SUMMARY OF THE INVENTION

本发明的目的是为了解决现有技术的缺点，而提出的一种基于多尺度统一的无监督单目视图深度估计方法。The purpose of the present invention is to solve the shortcomings of the prior art, and propose an unsupervised monocular view depth estimation method based on multi-scale unity.

为了实现上述目的，本发明采用了如下技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

一种基于多尺度统一的无监督单目视图深度估计方法，包括以下步骤：An unsupervised monocular view depth estimation method based on multi-scale unity, including the following steps:

步骤S1：对输入立体图像对进行金字塔多尺度处理，以此进行多个尺度的特征提取；Step S1: perform pyramid multi-scale processing on the input stereo image pair, so as to perform feature extraction on multiple scales;

步骤S2：构建编码解码的网络框架，获得可用于获取深度图的视差图；Step S2: constructing a network framework for encoding and decoding to obtain a disparity map that can be used to obtain a depth map;

步骤S3：将在编码阶段提取的特征输送至反向卷积神经网络实现不同尺度输入图像的特征提取，在解码阶段拟合不同尺度输入图像的视差图；Step S3: delivering the features extracted in the encoding stage to the inverse convolutional neural network to achieve feature extraction of input images of different scales, and fitting disparity maps of input images of different scales in the decoding stage;

步骤S4：将不同尺度的视差图统一上采样至原输入尺寸；Step S4: uniformly up-sampling the disparity maps of different scales to the original input size;

步骤S5：使用输入立体图像原图与对应的视差图进行图像重建；Step S5: using the original image of the input stereoscopic image and the corresponding disparity map to perform image reconstruction;

步骤S6：通过外观匹配损失，左右视差转换损失，以及视差平滑损失来约束图像重建的准确性；Step S6: Constrain the accuracy of image reconstruction through appearance matching loss, left-right disparity conversion loss, and disparity smoothing loss;

步骤S7：使用最小化损失的思想，采用梯度下降法训练网络模型；Step S7: Using the idea of minimizing loss, the gradient descent method is used to train the network model;

步骤S8：在测试阶段，根据输入图像与预训练模型拟合对应的视差图；利用双目成像的三角测量原理，由视差图计算得到对应场景深度图。Step S8: In the testing phase, fit a corresponding disparity map according to the input image and the pre-trained model; use the triangulation principle of binocular imaging to calculate the corresponding scene depth map from the disparity map.

优选的，所述步骤S1中，将输入图像下采样至原图像的1，1/2，1/4，1/8 此四个尺寸，形成金字塔输入结构，随后送至编码模型中进行特征提取。Preferably, in the step S1, the input image is down-sampled to four sizes of 1, 1/2, 1/4, and 1/8 of the original image to form a pyramid input structure, and then sent to the encoding model for feature extraction .

优选的，所述步骤S2中，采用ResNet-101网络结构作为编码阶段的网络模型，ResNet网络结构采用残差设计，在网络加深的同时，减少信息丢失。Preferably, in the step S2, the ResNet-101 network structure is used as the network model in the coding stage, and the ResNet network structure adopts the residual design, which reduces information loss while the network is deepened.

优选的，所述步骤S3中，在编码阶段对不同尺度的输入图像进行特征提取，将提取的特征输送至解码阶段的反向卷积神经网络实现视差图拟合，具体为：Preferably, in the step S3, feature extraction is performed on input images of different scales in the encoding stage, and the extracted features are sent to the inverse convolutional neural network in the decoding stage to implement disparity map fitting, specifically:

步骤S41：在编码阶段对金字塔结构的输入图像分别通过ResNet-101网络进行特征提取，并在提取过程中相对于不同尺寸的输入图像缩小至1/16，获得尺寸为原输入图像1/16，1/32，1/64，1/128的特征；Step S41: In the encoding stage, the input images of the pyramid structure are respectively subjected to feature extraction through the ResNet-101 network, and in the extraction process, the input images of different sizes are reduced to 1/16, and the obtained size is 1/16 of the original input image, 1/32, 1/64, 1/128 characteristics;

步骤S42：将编码阶段获得的四个尺寸的特征输入到解码阶段的网络中，在此过程中逐层对输入特征进行反卷积，使其恢复至原输入图像1，1/2，1/4，1/8 尺寸的金字塔结构，根据输入特征与反卷积网络分别拟合此4个尺寸图像的视差图；Step S42: Input the four-dimensional features obtained in the encoding stage into the network in the decoding stage, and deconvolve the input features layer by layer in the process to restore them to the original input image 1, 1/2, 1/ 4, 1/8 size pyramid structure, fit the disparity map of the 4 size images according to the input features and the deconvolution network;

优选的，所述步骤S4中，将尺寸为原输入图像的1，1/2，1/4，1/8的视差图统一上采样至原输入图像的尺寸。Preferably, in the step S4, the disparity maps whose sizes are 1, 1/2, 1/4, and 1/8 of the original input image are uniformly up-sampled to the size of the original input image.

优选的，所述步骤S5中，由于4个尺寸的视差图统一上采样至原输入尺寸，使用原本输入的左图I^l与右视差图d^r重建出右图

原右图I^r与左视差图d^l重建出左图

Preferably, in the step S5, since the disparity maps of 4 sizes are uniformly upsampled to the original input size, the right ^image is reconstructed using the originally input left image I1 and right disparity image d ^r

The left image is reconstructed from the original right image I ^r and the left disparity image d ^l

优选的，所述步骤S6中，利用原输入的左右视图以及重构所得左右视图计算损失约束图像重建的准确性；Preferably, in the step S6, the original input left and right views and the reconstructed left and right views are used to calculate the loss to constrain the accuracy of image reconstruction;

采用梯度下降法最小化损失函数，以此训练图像重建网络，具体为：The gradient descent method is used to minimize the loss function to train the image reconstruction network, which is as follows:

步骤S71：损失函数由三部分构成，分别为外观匹配损失外观损失C_a，平滑损失C_s和视差转换损失C_t；对于每一项损失，左图和右图的计算方式相同，最终的损失函数由三项组合而成：Step S71: The loss function is composed of three parts, which are the appearance matching loss C _a , the smoothing loss C _s and the disparity conversion loss C _t ; for each loss, the calculation methods of the left and right images are the same, and the final loss A function is composed of three items:

步骤S72：在原输入尺寸上的不同视差图与输入原图分别计算损失，得到对应4个损失C_i，i＝1,2,3,4，总的损失函数为

Step S72: Calculate the losses on the different disparity maps of the original input size and the input original image respectively, and obtain corresponding 4 losses C _i , i=1, 2, 3, 4, and the total loss function is

优选的，所述步骤S7中，使用最小化损失的思想，采用梯度下降法训练网络模型。Preferably, in the step S7, the idea of minimizing the loss is used, and the gradient descent method is used to train the network model.

优选的，所述步骤S8中，在测试阶段，使用输入的单张图像以及预训练模型拟合输入图像对应的视差图，根据双目成像的三角测量原理，利用视差图生成对应的深度图像，具体为：Preferably, in the step S8, in the testing stage, the input single image and the pre-trained model are used to fit the disparity map corresponding to the input image, and the corresponding depth image is generated by using the disparity map according to the triangulation principle of binocular imaging, Specifically:

其中，(i,j)为图像中任一点的像素级坐标，D(i,j)为该点的深度值，d(i,j)为该点的视差值，b为已知的两个相机之间的距离，f为相机已知焦距。Among them, (i,j) is the pixel-level coordinate of any point in the image, D(i,j) is the depth value of the point, d(i,j) is the disparity value of the point, and b is the known two The distance between the two cameras, f is the known focal length of the camera.

本发明所述的基于多尺度统一的无监督单目视图深度估计方法，在通常的深度学习方法在解决深度估计问题时，需要输入图像对应的真实深度图像，但该真实深度数据获取代价昂贵，且只能获得稀疏的点云深度，无法完全满足应用需求；在这样的情况下，本文采用图像重建损失来监督模型的训练过程，使用相对易获取双目图像代替真实深度来进行训练，从而解决实现无监督的深度估计；The multi-scale unified unsupervised monocular view depth estimation method of the present invention requires a real depth image corresponding to the input image when the common deep learning method solves the depth estimation problem, but the acquisition of the real depth data is expensive, And only the sparse point cloud depth can be obtained, which cannot fully meet the application requirements; in this case, this paper adopts the image reconstruction loss to supervise the training process of the model, and uses the relatively easy-to-obtain binocular image instead of the real depth for training, so as to solve the problem. achieve unsupervised depth estimation;

本发明所述的基于多尺度统一的无监督单目视图深度估计方法，通过在编码阶段对输入立体图像对进行金字塔多尺度处理，减小不同尺寸目标对深度估计的影响；The multi-scale unified unsupervised monocular view depth estimation method of the present invention reduces the influence of objects of different sizes on the depth estimation by performing pyramid multi-scale processing on the input stereo image pair in the encoding stage;

本发明所述的基于多尺度统一的无监督单目视图深度估计方法，在针对低尺度深度图模糊的情况，将所有的视差图统一采样至原输入尺寸，并在此尺寸上进行图像的重建以及损失的计算，改善了深度图空洞的问题；In the unsupervised monocular view depth estimation method based on multi-scale unity, in the case of low-scale depth map blur, all disparity maps are uniformly sampled to the original input size, and the image is reconstructed on this size And the calculation of the loss, which improves the problem of holes in the depth map;

本发明设计合理，无需利用真实深度数据监督网络训练，以容易获取的双目图像作为训练样本，大大降低了网络训练的获取难度，同时也解决了由于低尺度视差图模糊带来的深度图空洞的问题。The invention has a reasonable design, does not need to use real depth data to supervise network training, and uses easily obtained binocular images as training samples, which greatly reduces the acquisition difficulty of network training, and also solves the depth map hole caused by the blurring of low-scale disparity maps. The problem.

附图说明Description of drawings

图1为本发明提出的一种基于多尺度统一的无监督单目视图深度估计方法流程图；1 is a flowchart of a method for depth estimation of unsupervised monocular view based on multi-scale unity proposed by the present invention;

图2为本发明提出的一种基于多尺度统一的无监督单目视图深度估计方法的网络模型结构图；2 is a network model structure diagram of a multi-scale unified unsupervised monocular view depth estimation method proposed by the present invention;

图3为本发明提出的一种基于多尺度统一的无监督单目视图深度估计方法的网络结构瓶颈模块示意图；3 is a schematic diagram of a network structure bottleneck module of a multi-scale unified unsupervised monocular view depth estimation method proposed by the present invention;

图4为本发明提出的一种基于多尺度统一的无监督单目视图深度估计方法的尺度统一示意图；4 is a schematic diagram of scale unification of an unsupervised monocular view depth estimation method based on multi-scale unification proposed by the present invention;

图5为本发明提出的一种基于多尺度统一的无监督单目视图深度估计方法在经典驾驶数据集KITTI上的估计结果图，(a)为输入图像，(b)为深度估计结果图；5 is an estimation result diagram of a multi-scale unified unsupervised monocular view depth estimation method proposed by the present invention on the classic driving data set KITTI, (a) is an input image, (b) is a depth estimation result diagram;

图6为本发明提出的一种基于多尺度统一的无监督单目视图深度估计方法在道路场景实时拍摄图片上的泛化效果图，(a)为输入图像，(b)为深度估计结果图。6 is a generalization effect diagram of a multi-scale unified unsupervised monocular view depth estimation method proposed by the present invention on real-time captured pictures of road scenes, (a) is the input image, (b) is the depth estimation result diagram .

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments.

参照图1-6，一种基于多尺度统一的无监督单目视图深度估计方法，其中的无监督深度单目深度估计网络模型在本实验室台式工作站上进行，显卡采用 NVIDIA GeForceGTX 1080Ti，训练系统为Ubuntu14.04，采用TensorFlow 1.4.0 作为框架搭建平台；在经典的驾驶数据集KITTI 2015立体数据集上进行训练。Referring to Figure 1-6, an unsupervised monocular view depth estimation method based on multi-scale unity, the unsupervised depth monocular depth estimation network model is carried out on the desktop workstation of this laboratory, the graphics card adopts NVIDIA GeForceGTX 1080Ti, and the training system For Ubuntu14.04, use TensorFlow 1.4.0 as the framework to build the platform; train on the classic driving dataset KITTI 2015 stereo dataset.

如图1所示，本发明的一种基于多尺度统一的无监督单目视图深度估计方法，具体包括以下步骤：As shown in FIG. 1 , an unsupervised monocular view depth estimation method based on multi-scale unity of the present invention specifically includes the following steps:

步骤S1：采用经典驾驶KITTI中的双目数据集作为训练集，设置尺度参数为4，将图像下采样至输入图像的1/2，1/4，1/8，加上原图共有4个尺寸的输入图像，形成金字塔结构，随后送至ResNet-101神经网络模型中进行特征提取；Step S1: Use the binocular data set in the classic driving KITTI as the training set, set the scale parameter to 4, downsample the image to 1/2, 1/4, and 1/8 of the input image, plus the original image has a total of 4 sizes The input image forms a pyramid structure, which is then sent to the ResNet-101 neural network model for feature extraction;

步骤S2：构建编码解码的网络框架，获得可用于获取深度图的视差图；具体过程为：Step S2: constructing a network framework for encoding and decoding to obtain a disparity map that can be used to obtain a depth map; the specific process is:

采用ResNet-101网络结构作为编码阶段的网络模型，ResNet网络结构采用残差设计，在网络加深的同时，减少信息丢失；ResNet网络中的残差结构如图 3(a)所示，先使用1×1的卷积降低特征维度，之后再通过1×1的卷积恢复，则参数量为：The ResNet-101 network structure is used as the network model in the coding stage. The ResNet network structure adopts the residual design, which reduces the loss of information while the network is deepened; the residual structure in the ResNet network is shown in Figure 3(a), first use 1 The convolution of ×1 reduces the feature dimension, and then restores it through the convolution of 1 × 1, the parameter quantity is:

1×1×256×64+3×3×64×64+1×1×64×256＝696321×1×256×64+3×3×64×64+1×1×64×256=69632

而通常的ResNet模块如图3(b)所示，参数量为：The usual ResNet module is shown in Figure 3(b), and the parameters are:

3×3×256×256×2＝11796483×3×256×256×2=1179648

由此可见，使用带有瓶颈结构的残差模块可以大大的降低参数量；It can be seen that using the residual module with a bottleneck structure can greatly reduce the amount of parameters;

步骤S3：将在编码阶段提取的特征输送至反向卷积神经网络实现不同尺度输入图像的特征提取，在解码阶段拟合不同尺度输入图像的视差图，具体为：Step S3: The features extracted in the encoding stage are sent to the inverse convolutional neural network to achieve feature extraction of input images of different scales, and the disparity maps of input images of different scales are fitted in the decoding stage, specifically:

步骤S31：网络解码过程中，为确保反卷积神经网络中特征图的尺寸与 ResNet-101残差网络特征图尺寸相对应，本网络使用跳跃连接将ResNet-101 编码过程中的部分特征图直接连接到反卷积神经网络中；Step S31: During the network decoding process, in order to ensure that the size of the feature map in the deconvolutional neural network corresponds to the size of the ResNet-101 residual network feature map, the network uses skip connections to directly encode some feature maps in the ResNet-101 encoding process. connected to a deconvolutional neural network;

步骤S32：在编码阶段对金字塔结构的输入图像分别通过ResNet-101网络进行特征提取，并在提取过程中相对于不同尺寸的输入图像缩小至1/16，获得尺寸为原输入图像1/16，1/32，1/64，1/128的特征；Step S32: In the encoding stage, the input image of the pyramid structure is extracted by the ResNet-101 network, and in the extraction process, the input image of different sizes is reduced to 1/16, and the obtained size is 1/16 of the original input image, 1/32, 1/64, 1/128 characteristics;

步骤S33：将编码阶段获得的四个尺寸的特征输入到解码阶段的网络中，在此过程中逐层对输入特征进行反卷积，使其恢复至原输入图像1，1/2，1/4，1/8 尺寸的金字塔结构，根据输入特征与反卷积网络分别拟合此4个尺寸图像的近似视差图；Step S33: Input the four-dimensional features obtained in the encoding stage into the network in the decoding stage, and deconvolve the input features layer by layer in this process to restore them to the original input image 1, 1/2, 1/ 4. The pyramid structure of 1/8 size, according to the input feature and the deconvolution network, respectively fit the approximate disparity map of the 4 size images;

步骤S4：将尺寸为原输入图像的1，1/2，1/4，1/8的视差图统一上采样至原输入图像的尺寸；Step S4: uniformly up-sampling the disparity maps whose sizes are 1, 1/2, 1/4, and 1/8 of the original input image to the size of the original input image;

步骤S5：使用输入的原图与对应的视差图进行图像重建，利用视差图和其对应的左视图，重构出右视图，再使用原右图与左视差图重建出左图，最后将重建出的左图右图分别与输入的左右原图进行比较；Step S5: use the input original image and the corresponding disparity map to reconstruct the image, use the disparity map and its corresponding left view to reconstruct the right view, then use the original right image and the left disparity map to reconstruct the left image, and finally reconstruct the image. The output left and right images are compared with the input left and right original images respectively;

步骤S6：再利用外观匹配损失，左右视差转换损失，以及视差平滑损失来约束图像合成的准确性；具体为：Step S6: Reuse appearance matching loss, left and right parallax conversion loss, and parallax smoothing loss to constrain the accuracy of image synthesis; specifically:

步骤S61：损失函数由三部分构成，分别为外观匹配损失C_a，平滑损失C_s和视差转换损失C_t；Step S61: the loss function consists of three parts, which are the appearance matching loss C _a , the smoothing loss C _s and the disparity conversion loss C _t ;

在图像重建过程中，首先要使用外观匹配损失C_a，逐像素地判断重建出的图像与对应输入图像之间的准确度，该损失由结构相似度指标和L₁损失共同组成，以输入左图为例：In the image reconstruction process, the appearance matching loss C _a is first used to judge the accuracy between the reconstructed image and the corresponding input image pixel by pixel. The loss is composed of the structural similarity index and L ₁ loss. Figure as an example:

其中S为结构相似度指标，由亮度测量、对比度测量和结构对比三个部分组成，用于衡量两幅图像之间的相似度，两幅图像越相似，则相似度指标值越高； L₁损失为最小化绝对误差损失，用于逐像素地比较两幅图像之间的差距，相对与L₂损失来说对异常点更加不敏感；α为结构相似度在外观匹配损失中的权重系数，N为图像中像素总数；Among them, S is the structural similarity index, which consists of three parts: brightness measurement, contrast measurement and structural comparison. It is used to measure the similarity between two images. The more similar the two images are, the higher the similarity index value is; L ₁ The loss is to minimize the absolute error loss, which is used to compare the difference between the two images pixel by pixel, and is less sensitive to outliers _than the L2 loss; α is the weight coefficient of the structural similarity in the appearance matching loss, N is the total number of pixels in the image;

其次，平滑损失α_s可以减轻局部梯度过大导致视差图不连续的情况，确保所成视差图的平滑性，以左图为例，具体公式如下：Secondly, the smoothing loss α _s can reduce the discontinuity of the disparity map caused by the excessive local gradient, and ensure the smoothness of the disparity map. Take the left image as an example, the specific formula is as follows:

视差转换损失C_t的目的是为了减小根据左图生成的右视差图，与根据右图生成的左视差图之间的转换误差，确保两视差图之间的一致性，具体公式如下：The purpose of the disparity conversion loss C _t is to reduce the conversion error between the right disparity map generated from the left image and the left disparity map generated from the right image, and to ensure the consistency between the two disparity maps. The specific formula is as follows:

对于每一项损失，左图和右图的计算方式相同，最终的损失函数由三项组合而成：For each loss, the left and right graphs are calculated in the same way, and the final loss function is composed of three terms:

其中α_a为外观匹配损失在总体损失中所占权重，α_s为平滑损失在总体损失中所占权重，α_t为转换损失在总体损失中所占权重；where α _a is the weight of the appearance matching loss in the overall loss, α _s is the weight of the smoothing loss in the overall loss, and α _t is the weight of the transformation loss in the overall loss;

步骤S62：在原输入尺寸上的不同视差图与输入原图分别计算损失，得到对应4个损失C_i，i＝1,2,3,4，总的损失函数为

Step S62: Calculate the losses on the different disparity maps on the original input size and the input original image respectively, and obtain four corresponding losses C _i , i=1, 2, 3, 4, and the total loss function is

步骤S7：使用最小化损失的思想，采用梯度下降法训练网络模型，具体为：在立体图像对的训练过程中，我们采用开源的TensorFlow 1.4.0平台搭建深度估计模型，使用具有立体图像对的KITTI数据集进行作为训练集,将该数据集中的29000对用于模型的训练；在训练时，设置初始学习率lr，并在40个周期之后，每10个周期将学习率变为当前学习率的一半，一共训练70个周期；同时设置批处理大小为bs，即一次处理bs张图片；使用Adam优化器进行模型的优化，并设置β₁，β₂控制权重系数移动均值的衰减率，最终在GTX 1080Ti实验平台上花费34个小时完成所有的训练；Step S7: Using the idea of minimizing loss, the gradient descent method is used to train the network model, specifically: in the training process of the stereo image pair, we use the open source TensorFlow 1.4.0 platform to build a depth estimation model, and use the The KITTI dataset is used as a training set, and the 29,000 pairs in the dataset are used for model training; during training, the initial learning rate lr is set, and after 40 cycles, the learning rate is changed to the current learning rate every 10 cycles half of , and a total of 70 cycles of training; at the same time, set the batch size to bs, that is, to process bs pictures at a time; use the Adam optimizer to optimize the model, and set β ₁ , β ₂ to control the decay rate of the weight coefficient moving average, and finally It took 34 hours to complete all training on the GTX 1080Ti experimental platform;

表1损失函数及训练参数Table 1 Loss function and training parameters

步骤S8：在测试阶段，根据输入图像与预训练模型拟合对应的视差图；利用双目成像的三角测量原理，由视差图计算得到对应场景深度图；在本实验采用的KITTI道路驾驶数据集中，相机的基线距离固定为0.54m，相机的焦距根据所摄相机型号不同而改变，不同的相机型号在KITTI数据集中体现为不同的图像尺寸，对应关系如下表：Step S8: In the test phase, fit the corresponding disparity map according to the input image and the pre-training model; use the triangulation principle of binocular imaging to calculate the depth map of the corresponding scene from the disparity map; in the KITTI road driving data set used in this experiment. , the baseline distance of the camera is fixed at 0.54m, and the focal length of the camera changes according to the different camera models. Different camera models are reflected in different image sizes in the KITTI dataset. The corresponding relationship is as follows:

则深度与视差的转换公式具体为：Then the conversion formula of depth and disparity is as follows:

其中，(i,j)为图像中任一点的像素坐标，D(i,j)为该点的深度值，d(i,j)为该点的视差值；Among them, (i, j) is the pixel coordinate of any point in the image, D(i, j) is the depth value of the point, and d(i, j) is the disparity value of the point;

由此，根据输入图像与利用双目图像重建原理预训练的网络模型，拟合出输入图像对应的视差图，根据已知的相机焦距和基线距离，即可计算出该相机所拍摄输入图像的对应场景深度图。Therefore, according to the input image and the pre-trained network model using the principle of binocular image reconstruction, the disparity map corresponding to the input image is fitted, and according to the known camera focal length and baseline distance, the input image captured by the camera can be calculated Corresponds to the scene depth map.

本发明使用到的标准零件均可以从市场上购买，异形件根据说明书的和附图的记载均可以进行订制，各个零件的具体连接方式均采用现有技术中成熟的螺栓、铆钉、焊接等常规手段，机械、零件和设备均采用现有技术中，常规的型号，加上电路连接采用现有技术中常规的连接方式，在此不再详述。The standard parts used in the present invention can be purchased from the market, the special-shaped parts can be customized according to the description in the specification and the drawings, and the specific connection methods of each part adopt mature bolts, rivets, welding, etc. in the prior art Conventional means, machinery, parts and equipment all adopt conventional models in the prior art, and circuit connections adopt conventional connection methods in the prior art, which will not be described in detail here.

Claims

1. An unsupervised monocular view depth estimation method based on multi-scale unification is characterized by comprising the following steps:

step S1: carrying out pyramid multi-scale processing on the input stereo image pair so as to extract features of multiple scales;

step S2: constructing a network framework of coding and decoding to obtain a disparity map which can be used for calculating a depth map;

step S3: the features extracted in the encoding stage are transmitted to a reverse convolution neural network to realize the feature extraction of the input images with different scales, and the disparity maps of the input images with different scales are fitted in the decoding stage;

step S4: uniformly up-sampling disparity maps of different scales to an original input size;

step S5: reconstructing an image by using the input original image and a corresponding disparity map;

step S6: the accuracy of image reconstruction is constrained through appearance matching loss, left-right parallax conversion loss and parallax smoothing loss;

step S7: training a network model by using a gradient descent method by using a loss minimization idea;

step S8: in the testing stage, fitting a corresponding disparity map according to an input image and a pre-training model; and calculating a corresponding scene depth map by using a binocular imaging triangulation principle and the disparity map.

2. The method as claimed in claim 1, wherein in step S1, the input image is down-sampled to four sizes of 1, 1/2, 1/4, 1/8 of the original image to form a pyramid input structure, and then sent to the coding model for feature extraction.

3. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in step S2, a ResNet-101 network structure is adopted as a network model in an encoding stage.

4. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in step S3, feature extraction is performed on input images of different scales in an encoding stage, and the extracted features are transmitted to a deconvolution neural network in a decoding stage to implement disparity map fitting, specifically:

step S41: respectively performing feature extraction on the input image with the pyramid structure through a ResNet-101 network in an encoding stage, and reducing the input image to 1/16 in the extraction process relative to input images with different sizes to obtain features of original input images 1/16, 1/32, 1/64 and 1/128;

step S42: inputting the features of four sizes obtained in the encoding stage into the network in the decoding stage, deconvoluting the input features layer by layer in the process to restore the input features to the pyramid structures of the original input images 1, 1/2, 1/4 and 1/8 sizes, and respectively fitting the disparity maps of the images of 4 sizes according to the input features and the deconvolution network.

5. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in the step S4, the disparity maps with the size of 1, 1/2, 1/4, 1/8 of the original input image are unified up-sampled to the size of the original input image.

6. The unsupervised monocular view depth based on multi-scale unification of claim 1The estimation method is characterized in that in the step S5, since the disparity maps of 4 sizes are uniformly up-sampled to the original input size, the originally input left map I is used^lAnd the right parallax image d^rReconstruct a right image

Original right picture I^rAnd left parallax map d^lReconstruct the left image

7. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in step S6, accuracy of image reconstruction is constrained by calculating loss using the original input left and right views and the reconstructed left and right views;

minimizing a loss function by adopting a gradient descent method, and training an image reconstruction network by adopting the method, specifically:

step S71: the loss function is composed of three parts, namely appearance matching loss and appearance loss C_aSmoothing loss C_sAnd parallax conversion loss C_t(ii) a For each term loss, the left and right graphs are computed in the same way, and the final loss function is composed of three terms:

step S72: respectively calculating losses on different parallax maps and the input original image on the original input size to obtain 4 losses C_iI is 1,2,3,4, the total loss function is

8. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in step S7, a network model is trained by using a gradient descent method using an idea of minimizing loss.

9. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in the step S8, in the testing stage, an input single image and a pre-trained model are used to fit a disparity map corresponding to the input image, and according to a principle of triangulation of binocular imaging, the disparity map is used to generate a corresponding depth image, specifically:

where (i, j) is the pixel-level coordinate of any point in the image, D (i, j) is the depth value of the point, D (i, j) is the parallax value of the point, b is the known distance between two cameras, and f is the known focal length of the camera.