Technical Field
Depth estimation is always a popular research direction in the field of computer vision, and three-dimensional data provided by a depth map provides required information for implementation of applications such as three-dimensional reconstruction, Augmented Reality (AR), intelligent navigation and the like. Meanwhile, the position relation expressed by the depth map is very important in a plurality of image tasks, so that an image processing algorithm can be further simplified. Currently, the more common depth estimation is mainly classified into two categories, namely monocular depth estimation and binocular depth estimation.
The monocular depth estimation method only uses one camera, the camera captures continuous image frames in the traditional algorithm, and projection transformation is carried out through an interframe motion model so as to estimate the image depth. The monocular depth estimation based on deep learning is to train a deep neural network by using a data set with real depth information and regress depth by using the deep neural network obtained by learning. The algorithm has simple equipment and low cost and can be suitable for dynamic scenes. But at the same time, because of the lack of scale information, the depth information is usually not accurate enough, and the performance is often seriously degraded in an unknown scene. The binocular estimation method uses two calibrated cameras to view the same object from two different perspectives. Finding the same space point under different visual angles, calculating the parallax between corresponding pixel points, and then converting the parallax into depth through triangulation. The traditional binocular estimation adopts a stereo matching algorithm, so that the calculated amount is large, and the effect on low-texture scenes is poor. The binocular depth estimation based on the deep learning mostly adopts a supervised learning method, and due to the strong learning capacity of a neural network, compared with the traditional method, the accuracy and the speed of the conventional method are greatly improved.
However, supervised learning usually depends too much on the real value, but the real value may have problems of error and noise, sparse depth information, difficult calibration of hardware equipment, and the like, so that the estimated depth value is not accurate enough. The unsupervised learning method has been considered as a research direction in which artificial intelligence can really and effectively learn itself in the real world, and therefore, in recent years, the image depth estimation method based on unsupervised learning has become a research hotspot.
Disclosure of Invention
The invention aims to provide a binocular depth estimation method based on a depth neural network, which adopts the idea of unsupervised learning, only utilizes left and right viewpoint images acquired by a binocular camera as network input, and does not need to acquire depth information of the input images in advance as a training label. Meanwhile, the adaptive design of the network sets the internal and external parameters of the camera as independent model parameters, so that the method is suitable for a plurality of camera systems on the premise of not modifying the network. In addition, the neural network is basically not influenced by illumination, noise and the like, and the robustness is high.
The technical scheme for realizing the purpose of the invention is as follows:
a binocular depth estimation method based on a depth neural network comprises the following steps:
1) performing corresponding image preprocessing such as cutting and transformation on the input left and right viewpoint images to perform data enhancement, wherein the image preprocessing comprises mild affine deformation, random horizontal rotation, random scale jitter, random contrast, brightness, saturation, sharpness and the like, so that the number of samples is further increased, the training optimization of network parameters is facilitated, and the generalization capability of a network is enhanced;
2) and constructing a multi-scale network model of binocular depth estimation, wherein the model comprises a plurality of convolution layers, an activation layer, a residual connection, a multi-scale pooling connection, a linear up-sampling layer and the like.
(a) The network adopts three residual error network structures to carry out multi-scale convolution on input, and each residual error module comprises two convolution layers and an identity mapping. Except for the first convolution kernel of 3 x 3, the rest are 7 x 7 in size.
(b) The second, sixth and fourteenth layers in the network are multi-scale pooling modules, and the average pooling operation is performed on the outputs of the second and sixth layers, with a step size of 4, a kernel size of 4 × 4, a step size of 2, and a kernel size of 2 × 2, respectively, and the convolution is performed by 1 × 1 together with the output of the fourteenth layer.
(c) The left view and the right view are processed through a front-end network, and feature information of the left view and the right view is associated through feature correlation operation after passing through a multi-scale pooling module, so that feature correlation between the two views is calculated:
c(x1,x2)=∑o∈[-k,k]×[-k,k]<fl(x1+o),fr(x2+o)>
c is left image feature in x1The image block and the right image feature centered on x2Correlation of image blocks centered, flIs a left picture feature, frFor the right graph feature, the image block size is k × k.
(d) And then, the network recovers the original resolution of the image according to the correlation characteristics, and depth maps with different scales are obtained by utilizing deconvolution, upsampling and the like. In the linear up-sampling operation, the image is generated by adopting bilinear interpolation for the output of the upper layer, and jump layer connection is carried out with the upper sampling layer by utilizing residual learning, and finally the image is restored to the original size.
3) Setting initialization parameters according to a designed network model, and designing a loss function to obtain a minimization result in a continuous training process so as to obtain the optimal network weight.
The pixel values of the left and right views of the network input are respectively represented as I
l、I
rWhen the network obtains the predicted depth map of the left image
Using the camera internal reference matrix K
-1I to be in the image coordinate system
rConverting into a camera coordinate system, converting into the camera coordinate system of the left image by using the external reference matrix T, and then converting into the image coordinate system of the left image again by using the internal reference matrix K, thereby obtaining a transition image
The specific formula is as follows:
wherein
p
rIs the corresponding image pixel value. The projection transformation enables the pixel coordinates in the transition graph to be continuous values, so that the pixel value of each coordinate is determined by using a 4-neighborhood interpolation method, and finally the target graph is obtained
Where w is proportional to the spatial distance between the target point and the proximate point, anda,bwab=1。
construction of reconstruction loss function using Huber loss function
4) Inputting the image to be processed into the network model to obtain corresponding depth map, and repeating the above steps
Until the network converges or the training times are reached.
The invention provides a deep neural network based on unsupervised learning, which is used for carrying out network model training on left and right images without real depth information so as to obtain a monocular depth map. The invention adopts the advantage of multiple visual angles of the binocular camera, and realizes the output mapping from the input of a binocular image to a monocular depth image by utilizing the representation learning method of a multilayer representation form, namely a convolutional neural network. The network model obtains different scale receptive fields through multilayer down-sampling operation, utilizes a residual error structure to extract the characteristics of an input image, and adopts a multi-scale pooling module to strengthen the local texture detail information of the image, thereby improving the accuracy and the robustness of the network model. The upper sampling layer adopts a bilinear interpolation method, and a residual error structure is reused to learn information of a plurality of upper sampling layers, so that information loss in the process of recovering the size of the image is reduced, and the accuracy of depth estimation is further ensured.
The invention has the advantages and beneficial effects that:
1. the binocular depth estimation method based on the depth neural network is based on an unsupervised learning method, and the accuracy of the predicted depth value is ensured by utilizing the strong learning capacity of the depth convolution network.
2. The invention uses residual error connection for feature extraction for multiple times, completes multi-scale information fusion by utilizing skip layer connection in up-sampling, reduces the loss and loss of the traditional convolution in information transmission to a certain extent, ensures the integrity of information and greatly improves the network convergence speed.
3. According to the method, images with different scales are obtained through multiple downsampling, and different receptive fields of the images are obtained through a multi-scale pooling module to strengthen local texture details.
4. The characteristic correlation operation in the network carries out the characteristic correlation of the left view and the right view, is not easily influenced by noise, and improves the robustness of the network model.
5. The input image of the network does not have real depth information, the network calculates a target image by predicting a depth image, camera parameters and original input, and constructs a loss function by constructing a difference value between the target image and the original input so as to realize network parameter optimization, so that the whole network can finish training in an unsupervised learning mode.
6. The parameter information of the camera is set outside the network as a part of network parameters, so that the model is suitable for various camera systems with different configurations and has strong self-adaptive capability.
Detailed Description
The present invention will be described in further detail with reference to the following embodiments, which are illustrative only and not limiting, and the scope of the present invention is not limited thereby.
1) And performing corresponding image preprocessing such as cutting and transformation on the input left and right viewpoint images to perform data enhancement.
The invention adopts the images of left and right visual angles acquired by the binocular camera as network input and can output a monocular depth map under a left camera coordinate system or a right camera coordinate system. For convenience of description, the output monocular depth maps mentioned herein are all depth maps of the left image. The input image in the invention needs the RGB image of left and right visual angles, so the artificially synthesized data set scenefilow and part of data in KITTI2015 data set in real environment are adopted as training data. 39000 binocular images with 960 x 540 resolution and corresponding depth maps are contained in the large data set SceneFlow data set, and a large amount of training data can guarantee the learning capacity of the convolutional neural network. However, the SceneFlow data set is an artificially synthesized image, and therefore has a certain difference from a real image acquired in the real world. In order to enhance the application effect of the model in the daily life scene, the model is selected to be finely adjusted on the KITTI2015 data set in the example so as to adapt to the real scene. The KITTI2015 data set contains 200 binocular images and corresponding sparse depth maps. Because the method is of an unsupervised learning type, the scene flow dataset and the actual depth map data in the KITTI2015 dataset are not used. The higher resolution of the images in the data set makes the network training slower, so the images are randomly cropped to 320x180 to improve the network training speed. In addition, image preprocessing is carried out on the images in the data set, wherein the image preprocessing comprises slight affine deformation, random horizontal rotation, random scale jitter, random contrast, brightness, saturation, sharpness and the like, so that the number of samples is further increased, training optimization of network parameters is facilitated, and the generalization capability of the network is enhanced.
2) And constructing a multi-scale network model of binocular depth estimation, wherein the model comprises a plurality of convolution layers, an activation layer, a residual connection, a multi-scale pooling connection, a linear up-sampling layer and the like.
a) In order to reduce the parameter quantity of the network model on a large scale so that the network model is easier to converge and has stronger feature expression capability, the network selects 3 residual modules to carry out feature extraction on the input image. Except for the first layer, the remaining convolutional layers all use a small convolution kernel of 3 x 3 to better retain edge information. And carrying out batch standardization operation after each convolution layer to ensure that the data distribution is stable. And a ReLU activation function is adopted after each convolution layer in the model, so that the problem of gradient disappearance during network training is prevented. The output of each residual block is sampled down, a multi-scale pooling module is designed to perform average pooling operation on the input of the residual block with different sizes, and dimension reduction is performed through the 1 × 1 convolution layer, so that different feature information can be sensed on different scales, and the training parameters of the network are greatly reduced.
b) And finally obtaining a feature map of one eighth resolution of the original image after the input image passes through the three-time residual error module and the multi-scale pooling module and is subjected to dimensionality reduction. Left and right graph network weight sharing, calculating the characteristic correlation of the two graphs in the correlation operation, and the formula is as follows:
in the formula, x
1The feature block of the left image which is taken as the center can carry out correlation operation with all the feature blocks of the right image, and matching features from one point in the left image to all points in the right image are calculated in a traversing mode. The matrix can be regarded as matching cost of the feature blocks at different depths, and then depth regression is selected to be regarded as a classification problem. In the deep regression, firstly, the softmax function is utilized
j is 1, …, K, and K matching costs at the depth are converted into a probability distribution of the depth and then passed
Weighted summation mode obtains more stable depth estimation
Wherein
Indicating the depth of the predicted pixel, D
maxRepresenting the maximum disparity to be estimated, d being the respective depth values corresponding to the depth probability distribution, C
dThe matching cost is expressed, and the final output is the weighted sum of all possible depths of the pixel point and the possibility of the depth.
c) And performing bilinear interpolation on the matching cost of the small scale, adding the upsampled cost into the next larger scale, and performing skip-layer connection on the information of the multiple upsampled layers by utilizing residual connection. Residual learning in the up-sampling process fully utilizes multi-scale information, so that the network further refines depth estimation on the basis of depth estimation of the previous scale, and meanwhile, the network is easier to train.
3) Setting initialization parameters according to a designed network model, and designing a loss function to obtain a minimization result in a continuous training process so as to obtain the optimal network weight.
One key point of the invention is how to realize unsupervised learning, and the network needs to construct a reasonable loss function to train, optimize and adjust the training parameters. Assuming that the prediction target image is a left image depth image, obtaining a network prediction left image depth image
To obtain a target map
Firstly, an input image right image I in an image coordinate system needs to be processed
rUsing an internal reference matrix K
-1Conversion to the camera coordinate system of the right image, using the predicted left image depth map according to the stereo matching principle
Performing corresponding projection transformation with the external reference matrix T to obtain an image in a left image camera coordinate system, and performing coordinate system transformation again by using the matrix K to obtain a transition image in the left image coordinate system
The method can be obtained according to a binocular camera projection conversion formula:
wherein
p
rIs the corresponding image pixel value. Due to the characteristics of projective transformation, the transition diagram
The coordinates in (1) are converted to continuous values, so that 4 adjacent pixel values of the coordinates are used for linear interpolation. The coordinates of 4 adjacent pixels are respectively upper left, lower left, upper right and lower right, and the interpolation formula is as follows:
wherein,
for the corresponding pixel value of the target image, w is proportional to the spatial distance between the target point and the adjacent point, and ∑
a,bw
ab=1。
Therefore, the reconstruction loss function is given by:
wherein,
in the formula, x represents a difference value between corresponding pixel points of the target graph and the input graph, N is the number of pixel points of the image, and c is a threshold value set empirically, which is set to 1 in the present embodiment.
The Huber loss function has jump of first order difference at the c value, when the value is in the c range, the small residual gradient is better, and when the norm exceeds c, the large residual effect is better, so that the two losses can be effectively balanced.
The input image of the network does not have real depth information, but the original input image is estimated through a predicted depth map and a camera parameter matrix and is used as a network label to optimize the training parameter, so that the unsupervised learning of the network is realized. Meanwhile, the camera parameters can be modified externally in the optimization process of network training, so that the model is suitable for a plurality of camera systems and has self-adaptive performance.
4) And inputting the image to be processed into a network model to obtain a corresponding depth map, and continuously repeating the steps until the network converges or the training times are reached.
In the example, the synthesized big data set scenefiow is used for pre-training, and then the KITTI2015 data set is used for fine tuning, so that the network has high precision in daily real scenes, and the method has good universality.
The above description is only for the preferred embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution of the present invention and the inventive concept within the scope of the present invention, which is disclosed by the present invention, and the equivalent or change thereof belongs to the protection scope of the present invention.