CN113870335B

CN113870335B - Monocular depth estimation method based on multi-scale feature fusion

Info

Publication number: CN113870335B
Application number: CN202111232322.5A
Authority: CN
Inventors: 周非; 邓朝龙; 张黎敏
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-10-22
Filing date: 2021-10-22
Publication date: 2024-07-30
Anticipated expiration: 2041-10-22
Also published as: CN113870335A

Abstract

The invention relates to a monocular depth estimation method based on multi-scale feature fusion, which belongs to the field of three-dimensional scene perception, and comprises the following steps: s1: introducing a Non-Local attention mechanism, and constructing a mixed normalization function; s2: introducing an attention mechanism among the features of the layer, the deep features and the shallow features of the feature extraction network, and calculating an associated information matrix among the features on the feature map; s3: constructing a multi-scale feature fusion module; s4: and a hole space pyramid pooling module is introduced into the decoding network, so that the receptive field of convolution is enlarged, and the network is forced to learn more local detail information. The invention effectively realizes cross-space and cross-scale feature fusion between layered features of the feature extraction network, improves the ability of local detail learning of the network, ensures that the depth map completes fine-granularity prediction in the reconstruction process, and leads the introduced parameters to be relatively lower compared with the whole network.

Description

Monocular depth estimation method based on multi-scale feature fusion

Technical Field

The invention belongs to the field of three-dimensional scene perception, and relates to a monocular depth estimation method based on multi-scale feature fusion.

Background

Currently, the mainstream monocular depth estimation method is divided into an unsupervised learning method and a supervised learning method. The non-supervision learning method does not need to collect real depth labels, a stereoscopic image pair formed by an original image and a target image is utilized in training, firstly, the depth image of the original image is predicted by an encoder, then, the original image is reconstructed by combining the target image and the predicted depth image by a decoder, and the reconstructed image is compared with the original image to calculate loss. The supervised learning method is one of the most popular methods at present, and depth labels are usually acquired by using a depth camera or a laser radar, and image depth estimation is treated as a regression task or a classification task. The spatial structure information can be lost in the coding network of most of the single-order depth estimation models due to insufficient feature extraction and feature extraction stages, and the spatial structure relationship between the feature contexts is difficult to consider by a common local convolution module due to the complex structure of an actual scene, so that the phenomena of fuzzy and distorted estimated depth map scale occur. To solve this problem, the document "Chen et al ,Structure-aware residual pyramid network for monocular depth estimation.In the International Joint Conference on Artificial Intelligence(IJCAI).2019" proposes to construct a multi-scale feature fusion module by using a residual pyramid network, and the multi-scale feature fusion module acquires a depth map with more obvious structural hierarchy by extracting features with different scales, and although the design of the network structure can greatly improve the accuracy of image depth estimation, the network model is complex and the calculation cost is high. Document "Chen et al, attention-based context aggregation network for monocular depth actuation.2019" uses an Attention-based aggregation network to capture continuous context information and integrate image-level and pixel-level context information, but does not enable capture of context information and interaction of spatial information between features on multiple scales.

In summary, the problems existing in the technical field of monocular depth estimation at present are: 1) In the field of image depth estimation based on deep learning, most of network structures adopt coding and decoding structures, and the coding network can cause problems of insufficient feature extraction, space information loss and the like in a feature extraction stage, so that the network is difficult to learn some detailed information in an image. 2) Partial image features can be lost in the continuous up-sampling process of the high-dimensional semantic features by the decoding network, so that the effect of depth map reconstruction is poor, and the prediction of fine-granularity depth maps is not facilitated. 3) The actual scene structure faced by monocular depth estimation is complex, and if the spatial structure relation in the scene is not effectively considered, the estimated depth map precision is not high.

Disclosure of Invention

In view of the above, the present invention aims to provide a monocular depth estimation method based on multi-scale feature fusion, which introduces a Non-Local module for solving the problems that the monocular depth estimation coding network has insufficient feature extraction and the network is difficult to learn more detailed information due to the fact that spatial information is easily lost in the feature extraction stage, and builds a multi-scale feature fusion module based on an attention mechanism while improving Non-Local. In the decoding network, the cavity convolution in the cavity space pyramid pooling module is adopted to make up for the insufficient receptive field of the common local convolution module, and the problems of image feature loss and the like caused by up-sampling in the depth map reconstruction process are greatly solved, the precision of monocular depth estimation is improved, and the problems of depth map scale blurring, distortion and the like are solved.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a monocular depth estimation method based on multi-scale feature fusion comprises the following steps:

S1: introducing a Non-Local attention mechanism, and constructing a mixed normalization function;

s2: introducing an attention mechanism among the features of the layer, the deep features and the shallow features of the feature extraction network, and calculating an associated information matrix among the features on the feature map;

s3: constructing a multi-scale feature fusion module;

s4: and a hole space pyramid pooling module is introduced into the decoding network, so that the receptive field of convolution is enlarged, and the network is forced to learn more local detail information.

Further, the step S1 includes:

on the basis of Non-Local, a mixed SoftMox layer is constructed as a normalization function, and the calculation formula of the normalization function is as follows:

Wherein the method comprises the steps of Is the similarity score of the nth part, i is the current pixel point on the feature map, j is all the pixel points on the feature map, pi _n is the nth aggregate weight, N is the number of feature map partitions, w _n is a linear vector that can be learned in network training,Is the arithmetic mean corresponding to each region k _j on the feature map X.

Further, the step S2 specifically includes the following steps:

S21: the relationship modeling is performed on the current feature point q _i by using other feature points k _j on the feature map through self-conversion, and the calculation formula is as follows:

Where w _i,j represents a spatial attention map, F _mos (·) represents a normalization function, q _i,n represents an index, k _j,n represents a key, Representing the multiplication by element,Representing a self-converted feature map, v _j representing values;

s22: by means of top-down feature conversion, high-dimensional semantic information is utilized Contextual information for low-dimensional featuresModeling is carried out, and the calculation formula is as follows:

w_i,j＝F_mos(F_eud(q_i,n,k_j,n))

Wherein F _eud (·) represents the Euclidean distance between two pixel points on the feature map;

S23: through feature conversion from bottom to top, relevant information modeling is carried out among feature map channels with different scales, and a specific calculation formula is as follows:

w＝GAP(K)

Q_att＝F_att(Q,w)

V_dow＝F_sconv(V)

Where w represents the channel attention map, GAP represents the global average pooling, K represents the keys of the network shallow feature map, Q _att represents the channel attention weighted features, F _att (-) represents the outer product function, Q represents the index of the network deep feature map, V _dow represents the downsampled feature map, F _sconv (-) is a 3X 3 convolution with step size, V represents the values of the network shallow feature map, Representing the feature map after bottom-up conversion, F _conv (·) is a3×3 convolution for refinement, and F _add (·) represents the element-wise addition of the two feature maps followed by a3×3 convolution process again.

Further, in the step S3, three kinds of feature conversion in the step S2 are performed on the middle 4-layer features of the coding network, so as to obtain a plurality of enhanced advanced features, then the enhanced features are rearranged according to the scale, the features with the same size are cascaded with the original features on the coding network, and finally the channel dimension of the enhanced features is restored to the same dimension when input through a3×3 convolution.

Further, the step S4 specifically includes the following steps:

S41: a plurality of cavity space pyramid pooling modules are embedded between feature graphs with different resolutions in a crossing way, so that feature pyramid information with a large receptive field is captured, and the lost feature information in the up-sampling process is compensated;

S42: selecting a deconvolution module with a parameter capable of being learned based on an upsampling method; for all up-sampling modules, the output of the previous cavity space pyramid pooling module is deconvoluted to make the size of the feature map twice as large as the original one, and then the corresponding features output by the multi-scale feature fusion module and the depth map of the previous scale coarse estimation are cascaded;

S43: and for all the hole space pyramid pooling modules, respectively executing hole convolution with different hole rates on the input features, and outputting the input features as cascade connection of different hole convolution output features.

The invention has the beneficial effects that: 1) The invention is inspired by the space attention and the channel attention, and provides a multiscale feature fusion module based on an attention mechanism, which effectively realizes the cross-space and cross-scale feature fusion between layered features of a feature extraction network; 2) According to the invention, the hole space pyramid pooling module in the semantic segmentation field is introduced into the decoding network, so that the capability of the network for learning local details is improved, the depth map is predicted in a fine granularity manner in the reconstruction process, and the introduced parameters are relatively low compared with the whole network. Simulation results show that the method has higher performance compared with SARPN and ACAN algorithms.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of a monocular depth estimation algorithm based on multi-scale feature fusion of the present invention;

FIG. 2 is a flow chart of an attention mechanism construction of the present invention, (a) is based on a self-converting attention module, (b) is based on a top-down converting attention module, and (c) is based on a bottom-up converting attention module;

FIG. 3 is a flow chart of a multi-scale feature fusion module constructed based on an attention mechanism of the present invention;

FIG. 4 is a flow chart of the building of the pyramid pooling module based on the void space provided by the invention;

FIG. 5 is an ablation experiment based on a multi-scale feature fusion module, wherein the 3 rd line of the graph is a depth map predicted by a basic network, and the 4 th line of the graph is a depth map predicted after the multi-scale feature fusion module improves the feature extraction capability of a coding network;

FIG. 6 is an ablation experiment based on a hole space pyramid pooling module, wherein the 3 rd column in the figure is a depth map predicted by a basic network, and the 4 th column is a depth map predicted after the depth map reconstruction capability of a decoding network is improved based on the hole space pyramid pooling module;

FIG. 7 is a schematic diagram of a comparison between an overall monocular depth estimation network and ACAN network predicted depth maps using the improvement of the present invention, where line 2 is the ACAN network predicted depth map and line 3 is the improved network predicted depth map of the present invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.

Please refer to fig. 1-6, which are a monocular depth estimation method based on multi-scale feature fusion.

Fig. 1 is a block diagram of a monocular depth estimation network based on multi-scale feature fusion according to an embodiment of the present invention, as shown in the drawing, a monocular depth estimation algorithm based on multi-scale feature fusion according to an embodiment of the present invention includes:

In the feature extraction stage of the backbone network, attention mechanisms are introduced between different layer features, and the construction details of the attention mechanisms are shown in fig. 2. In order to make the standard SoftMax layer more efficient on the image, a hybrid SoftMax function was constructed, the normalization function was calculated as follows:

where w _n is a linear vector that can be learned in network training, Is the arithmetic mean corresponding to each region k _j on the feature map X,Is the similarity score for the nth portion.

Then, based on the construction of three feature transformations, the implementation details of the feature transformations are shown in fig. 3. First, the relationship modeling is performed on the current feature point q _i by using other feature points k _j on the feature map through self-conversion, and the calculation formula is as follows:

Wherein, Representing element-wise multiplication.

Secondly, by means of feature conversion from top to bottom, high-dimensional semantic information is utilizedContextual information for low-dimensional featuresModeling is carried out, and the calculation formula is as follows:

w_i,j＝F_mos(F_eud(q_i,n,k_j,n))

In the formula, F _eud (DEG) represents the Euclidean distance between two pixel points on the feature map.

And finally, modeling relevant information among feature map channels with different scales through feature conversion from bottom to top, wherein a specific calculation formula is as follows:

w＝GAP(K)

Q_att＝F_att(Q,w)

V_dow＝F_sconv(V)

Where F _att (·) represents the outer product function, F _sconv (·) is a 3×3 convolution with step size, F _conv (·) is a 3×3 convolution used to refine, and F _add (·) represents the element-wise addition of the two feature maps followed again by a 3×3 convolution process.

The construction of the multiscale feature fusion module based on the attention mechanism is as shown in fig. 3, firstly, self-conversion, top-down conversion and bottom-up conversion are respectively carried out on four layers of features in the middle of the coding network to obtain a plurality of enhanced advanced features, then, feature rearrangement is carried out on the enhanced features according to the scale, the features with the same size are cascaded with the original features on the coding network, finally, the channel dimension of the enhanced features is restored to the same dimension in input through a 3×3 convolution, and the context relation of a scene space structure is considered by the features output by the multiscale feature fusion module compared with the input features, so that the feature extraction capability of the feature extraction network is greatly improved.

Finally, a hole space pyramid pooling module is introduced for the image characteristics of the coding network, which are easy to lose in the depth map reconstruction process, and the structure of the module is shown in fig. 4. The method comprises the following specific steps:

1) A plurality of cavity space pyramid pooling modules are embedded between feature graphs with different resolutions in a crossing way, so that feature pyramid information with a large receptive field is captured, and the lost feature information in the up-sampling process is compensated;

2) The parameter-learnable deconvolution module is selected based on an upsampling method. For all up-sampling modules, the output of the previous cavity space pyramid pooling module is deconvoluted to make the size of the feature map twice as large as the original one, and then the corresponding features output by the multi-scale feature fusion module and the depth map of the previous scale coarse estimation are cascaded;

3) And for all the hole space pyramid pooling modules, respectively executing hole convolution with different hole rates on the input features, wherein the output of the module is cascade connection of different hole convolution output features.

The invention uses the data in NYU-Depth v2 and KITTI data sets to carry out experiments on the proposed monocular Depth estimation method based on multi-scale feature fusion, the NYU-Depth v2 data set is obtained by collecting indoor scenes by a Microsoft Kinect RGB-D camera and consists of a Depth map and an RGB map, and the KITTI data set is collected by a radar sensor and a vehicle-mounted camera under various road environments. These two data sets are the usual data sets for monocular depth estimation evaluation in indoor and outdoor scenarios.

Fig. 5 shows a comparison of the front and rear results of a multi-scale feature fusion module introduced by the present invention, in which line 1 is an input RGB image, line 2 is a true depth map, line 3 is a depth map predicted by a base network of the present invention, and line 4 is a depth map predicted after the multi-scale feature fusion module is introduced. From the figure, the depth map scale predicted by the basic network is very fuzzy, and the depth prediction of the edge of the object is not clear. After the feature extraction capability of the coding network is improved through the multi-scale feature fusion module, the prediction of the depth map becomes more accurate, and the edge profile of the object is clearer.

Fig. 6 is a comparison graph of the results before and after the hole space pyramid pooling module introduced in the decoding network, wherein the 1 st column in the graph is an input RGB image, the 2 nd column is a real depth map, the 3 rd column is a depth map predicted by the basic network of the invention, and the 4 th column is a depth map predicted after the hole space pyramid pooling module is introduced. It can be seen from the figure that after an improvement is made to the decoding network, the network is made more accurate for recovering the local details of some objects. And observing the bookshelf in the white frame of the 2 nd row, wherein the true depth map of the bookshelf is missing, and the network after improvement complements the missing depth map. Therefore, the introduction of the hole space pyramid pooling module can effectively solve the loss of image detail information, and the fine granularity prediction is completed.

Fig. 7 is a schematic diagram of a comparison of predicted depth maps between an overall monocular depth estimation network and ACAN monocular depth estimation network, modified in accordance with the present invention. With KITTI datasets, line 1 in the figure is the input RGB image, line 2 is the network estimated depth map of ACAN, and line 3 is the improved network estimated depth map of the present invention. As can be seen from the figure, ACAN networks are unclear in predicting the near-point targets and often fail in predicting the far-distance target depths, and the improved network predicted depth map can reserve clear outlines and detailed information for various targets, so that the monocular depth estimation accuracy is greatly improved.

Table 1 counts the average relative error, root mean square error, logarithmic average error and the accuracy under threshold for the method of the present invention versus other different algorithms under the NYU-Depth v2 dataset. From the data in table 1, it can be seen that the method of the present invention obtains good results on most indexes, and improves the accuracy of depth map estimation to some extent. Compared with SARPN, the method of the invention improves the threshold accuracy by 1.2%, and reduces the errors of other indexes to different degrees. Compared with ACAN, the method of the invention reduces the average relative error by 16 percent and improves the threshold accuracy by 5.3 percent. It is apparent that the advantages of using a multi-scale feature fusion module based on an attention mechanism are reflected.

TABLE 1

The monocular depth estimation algorithm based on multi-scale feature fusion effectively solves the problems of insufficient feature extraction of a feature extraction network, missing of spatial information and the like, and improves the accuracy of a network prediction depth map. Attention mechanisms are applied between hierarchical features of backbone networks for the first time, focusing on the spatial structural relationship between features. And a cavity space pyramid pooling module is introduced into the decoding network, so that the problem of image characteristic loss in the depth map reconstruction process is solved, and the receptive field of local convolution is enlarged. Simulation results show that the monocular depth estimation algorithm for multi-scale feature fusion provided by the invention has good effects on the aspects of precision, object edge contour reconstruction and the like, and has good performance.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. A monocular depth estimation method based on multi-scale feature fusion is characterized by comprising the following steps of: the method comprises the following steps:

s1: introducing a Non-Local attention mechanism, and constructing a mixed normalization function; the step S1 includes:

Wherein the method comprises the steps of Is the similarity score of the nth part, i is the current pixel point on the feature map, j is all the pixel points on the feature map, pi _n is the nth aggregate weight, N is the number of feature map partitions, w _n is a linear vector that can be learned in network training,Is the arithmetic mean corresponding to each region k _j on the feature map X;

s2: introducing an attention mechanism among the features of the layer, the deep features and the shallow features of the feature extraction network, and calculating an associated information matrix among the features on the feature map; the step S2 specifically includes the following steps:

Where w _i,j represents a spatial attention map, F _mos (·) represents a normalization function, q _i,n represents an index, k _j,n represents a key, Representing element-by-element multiplication, X represents a self-transformed feature map, and v _j represents a value;

w_i,j＝F_mos(F_eud(q_i,n,k_j,n))

w＝GAP(K)

Q_att＝F_att(Q,w)

V_dow＝F_sconv(V)

Where w represents the channel attention map, GAP represents the global average pooling, K represents the keys of the network shallow feature map, Q _att represents the channel attention weighted features, F _att (-) represents the outer product function, Q represents the index of the network deep feature map, V _dow represents the downsampled feature map, F _sconv (-) is a 3X 3 convolution with step size, V represents the values of the network shallow feature map, Representing the feature map after bottom-up conversion, F _conv (·) is a3×3 convolution for refinement, and F _add (·) represents that the two feature maps are subjected to element-by-element addition and then subjected to 3×3 convolution again;

S3: constructing a multi-scale feature fusion module; in the step S3, three kinds of feature conversion in the step S2 are respectively performed on the middle 4-layer features of the coding network to obtain a plurality of enhanced advanced features, then the enhanced features are subjected to feature rearrangement according to the scale, the features with the same size are cascaded with the original features on the coding network, and finally the channel dimension of the enhanced features is restored to the same dimension in input through a 3×3 convolution;

S4: and a hole space pyramid pooling module is introduced into the decoding network, so that the receptive field of convolution is enlarged, and the network learns more local detail information.

2. The monocular depth estimation method based on multi-scale feature fusion of claim 1, wherein: the step S4 specifically includes the following steps:

S41: a plurality of cavity space pyramid pooling modules are embedded between feature graphs with different resolutions in a crossing way, and feature pyramid information with a large receptive field is captured;