CN113870335B - Monocular depth estimation method based on multi-scale feature fusion - Google Patents
Monocular depth estimation method based on multi-scale feature fusion Download PDFInfo
- Publication number
- CN113870335B CN113870335B CN202111232322.5A CN202111232322A CN113870335B CN 113870335 B CN113870335 B CN 113870335B CN 202111232322 A CN202111232322 A CN 202111232322A CN 113870335 B CN113870335 B CN 113870335B
- Authority
- CN
- China
- Prior art keywords
- features
- feature
- network
- map
- feature map
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 230000004927 fusion Effects 0.000 title claims abstract description 36
- 238000011176 pooling Methods 0.000 claims abstract description 24
- 238000000605 extraction Methods 0.000 claims abstract description 18
- 230000007246 mechanism Effects 0.000 claims abstract description 15
- 238000010606 normalization Methods 0.000 claims abstract description 10
- 239000011159 matrix material Substances 0.000 claims abstract description 3
- 238000006243 chemical reaction Methods 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 12
- 238000005070 sampling Methods 0.000 claims description 7
- 238000012549 training Methods 0.000 claims description 4
- 238000005192 partition Methods 0.000 claims description 2
- 230000008707 rearrangement Effects 0.000 claims description 2
- 230000008569 process Effects 0.000 abstract description 10
- 230000006870 function Effects 0.000 abstract description 9
- 230000008447 perception Effects 0.000 abstract description 2
- 102100036601 Aggrecan core protein Human genes 0.000 description 7
- 101000999998 Homo sapiens Aggrecan core protein Proteins 0.000 description 7
- 238000004422 calculation algorithm Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 238000010276 construction Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000002679 ablation Methods 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000011800 void material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20016—Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a monocular depth estimation method based on multi-scale feature fusion, which belongs to the field of three-dimensional scene perception, and comprises the following steps: s1: introducing a Non-Local attention mechanism, and constructing a mixed normalization function; s2: introducing an attention mechanism among the features of the layer, the deep features and the shallow features of the feature extraction network, and calculating an associated information matrix among the features on the feature map; s3: constructing a multi-scale feature fusion module; s4: and a hole space pyramid pooling module is introduced into the decoding network, so that the receptive field of convolution is enlarged, and the network is forced to learn more local detail information. The invention effectively realizes cross-space and cross-scale feature fusion between layered features of the feature extraction network, improves the ability of local detail learning of the network, ensures that the depth map completes fine-granularity prediction in the reconstruction process, and leads the introduced parameters to be relatively lower compared with the whole network.
Description
Technical Field
The invention belongs to the field of three-dimensional scene perception, and relates to a monocular depth estimation method based on multi-scale feature fusion.
Background
Currently, the mainstream monocular depth estimation method is divided into an unsupervised learning method and a supervised learning method. The non-supervision learning method does not need to collect real depth labels, a stereoscopic image pair formed by an original image and a target image is utilized in training, firstly, the depth image of the original image is predicted by an encoder, then, the original image is reconstructed by combining the target image and the predicted depth image by a decoder, and the reconstructed image is compared with the original image to calculate loss. The supervised learning method is one of the most popular methods at present, and depth labels are usually acquired by using a depth camera or a laser radar, and image depth estimation is treated as a regression task or a classification task. The spatial structure information can be lost in the coding network of most of the single-order depth estimation models due to insufficient feature extraction and feature extraction stages, and the spatial structure relationship between the feature contexts is difficult to consider by a common local convolution module due to the complex structure of an actual scene, so that the phenomena of fuzzy and distorted estimated depth map scale occur. To solve this problem, the document "Chen et al ,Structure-aware residual pyramid network for monocular depth estimation.In the International Joint Conference on Artificial Intelligence(IJCAI).2019" proposes to construct a multi-scale feature fusion module by using a residual pyramid network, and the multi-scale feature fusion module acquires a depth map with more obvious structural hierarchy by extracting features with different scales, and although the design of the network structure can greatly improve the accuracy of image depth estimation, the network model is complex and the calculation cost is high. Document "Chen et al, attention-based context aggregation network for monocular depth actuation.2019" uses an Attention-based aggregation network to capture continuous context information and integrate image-level and pixel-level context information, but does not enable capture of context information and interaction of spatial information between features on multiple scales.
In summary, the problems existing in the technical field of monocular depth estimation at present are: 1) In the field of image depth estimation based on deep learning, most of network structures adopt coding and decoding structures, and the coding network can cause problems of insufficient feature extraction, space information loss and the like in a feature extraction stage, so that the network is difficult to learn some detailed information in an image. 2) Partial image features can be lost in the continuous up-sampling process of the high-dimensional semantic features by the decoding network, so that the effect of depth map reconstruction is poor, and the prediction of fine-granularity depth maps is not facilitated. 3) The actual scene structure faced by monocular depth estimation is complex, and if the spatial structure relation in the scene is not effectively considered, the estimated depth map precision is not high.
Disclosure of Invention
In view of the above, the present invention aims to provide a monocular depth estimation method based on multi-scale feature fusion, which introduces a Non-Local module for solving the problems that the monocular depth estimation coding network has insufficient feature extraction and the network is difficult to learn more detailed information due to the fact that spatial information is easily lost in the feature extraction stage, and builds a multi-scale feature fusion module based on an attention mechanism while improving Non-Local. In the decoding network, the cavity convolution in the cavity space pyramid pooling module is adopted to make up for the insufficient receptive field of the common local convolution module, and the problems of image feature loss and the like caused by up-sampling in the depth map reconstruction process are greatly solved, the precision of monocular depth estimation is improved, and the problems of depth map scale blurring, distortion and the like are solved.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a monocular depth estimation method based on multi-scale feature fusion comprises the following steps:
S1: introducing a Non-Local attention mechanism, and constructing a mixed normalization function;
s2: introducing an attention mechanism among the features of the layer, the deep features and the shallow features of the feature extraction network, and calculating an associated information matrix among the features on the feature map;
s3: constructing a multi-scale feature fusion module;
s4: and a hole space pyramid pooling module is introduced into the decoding network, so that the receptive field of convolution is enlarged, and the network is forced to learn more local detail information.
Further, the step S1 includes:
on the basis of Non-Local, a mixed SoftMox layer is constructed as a normalization function, and the calculation formula of the normalization function is as follows:
Wherein the method comprises the steps of Is the similarity score of the nth part, i is the current pixel point on the feature map, j is all the pixel points on the feature map, pi n is the nth aggregate weight, N is the number of feature map partitions, w n is a linear vector that can be learned in network training,Is the arithmetic mean corresponding to each region k j on the feature map X.
Further, the step S2 specifically includes the following steps:
S21: the relationship modeling is performed on the current feature point q i by using other feature points k j on the feature map through self-conversion, and the calculation formula is as follows:
Where w i,j represents a spatial attention map, F mos (·) represents a normalization function, q i,n represents an index, k j,n represents a key, Representing the multiplication by element,Representing a self-converted feature map, v j representing values;
s22: by means of top-down feature conversion, high-dimensional semantic information is utilized Contextual information for low-dimensional featuresModeling is carried out, and the calculation formula is as follows:
wi,j=Fmos(Feud(qi,n,kj,n))
Wherein F eud (·) represents the Euclidean distance between two pixel points on the feature map;
S23: through feature conversion from bottom to top, relevant information modeling is carried out among feature map channels with different scales, and a specific calculation formula is as follows:
w=GAP(K)
Qatt=Fatt(Q,w)
Vdow=Fsconv(V)
Where w represents the channel attention map, GAP represents the global average pooling, K represents the keys of the network shallow feature map, Q att represents the channel attention weighted features, F att (-) represents the outer product function, Q represents the index of the network deep feature map, V dow represents the downsampled feature map, F sconv (-) is a 3X 3 convolution with step size, V represents the values of the network shallow feature map, Representing the feature map after bottom-up conversion, F conv (·) is a3×3 convolution for refinement, and F add (·) represents the element-wise addition of the two feature maps followed by a3×3 convolution process again.
Further, in the step S3, three kinds of feature conversion in the step S2 are performed on the middle 4-layer features of the coding network, so as to obtain a plurality of enhanced advanced features, then the enhanced features are rearranged according to the scale, the features with the same size are cascaded with the original features on the coding network, and finally the channel dimension of the enhanced features is restored to the same dimension when input through a3×3 convolution.
Further, the step S4 specifically includes the following steps:
S41: a plurality of cavity space pyramid pooling modules are embedded between feature graphs with different resolutions in a crossing way, so that feature pyramid information with a large receptive field is captured, and the lost feature information in the up-sampling process is compensated;
S42: selecting a deconvolution module with a parameter capable of being learned based on an upsampling method; for all up-sampling modules, the output of the previous cavity space pyramid pooling module is deconvoluted to make the size of the feature map twice as large as the original one, and then the corresponding features output by the multi-scale feature fusion module and the depth map of the previous scale coarse estimation are cascaded;
S43: and for all the hole space pyramid pooling modules, respectively executing hole convolution with different hole rates on the input features, and outputting the input features as cascade connection of different hole convolution output features.
The invention has the beneficial effects that: 1) The invention is inspired by the space attention and the channel attention, and provides a multiscale feature fusion module based on an attention mechanism, which effectively realizes the cross-space and cross-scale feature fusion between layered features of a feature extraction network; 2) According to the invention, the hole space pyramid pooling module in the semantic segmentation field is introduced into the decoding network, so that the capability of the network for learning local details is improved, the depth map is predicted in a fine granularity manner in the reconstruction process, and the introduced parameters are relatively low compared with the whole network. Simulation results show that the method has higher performance compared with SARPN and ACAN algorithms.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.
Drawings
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of a monocular depth estimation algorithm based on multi-scale feature fusion of the present invention;
FIG. 2 is a flow chart of an attention mechanism construction of the present invention, (a) is based on a self-converting attention module, (b) is based on a top-down converting attention module, and (c) is based on a bottom-up converting attention module;
FIG. 3 is a flow chart of a multi-scale feature fusion module constructed based on an attention mechanism of the present invention;
FIG. 4 is a flow chart of the building of the pyramid pooling module based on the void space provided by the invention;
FIG. 5 is an ablation experiment based on a multi-scale feature fusion module, wherein the 3 rd line of the graph is a depth map predicted by a basic network, and the 4 th line of the graph is a depth map predicted after the multi-scale feature fusion module improves the feature extraction capability of a coding network;
FIG. 6 is an ablation experiment based on a hole space pyramid pooling module, wherein the 3 rd column in the figure is a depth map predicted by a basic network, and the 4 th column is a depth map predicted after the depth map reconstruction capability of a decoding network is improved based on the hole space pyramid pooling module;
FIG. 7 is a schematic diagram of a comparison between an overall monocular depth estimation network and ACAN network predicted depth maps using the improvement of the present invention, where line 2 is the ACAN network predicted depth map and line 3 is the improved network predicted depth map of the present invention.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.
Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.
Please refer to fig. 1-6, which are a monocular depth estimation method based on multi-scale feature fusion.
Fig. 1 is a block diagram of a monocular depth estimation network based on multi-scale feature fusion according to an embodiment of the present invention, as shown in the drawing, a monocular depth estimation algorithm based on multi-scale feature fusion according to an embodiment of the present invention includes:
In the feature extraction stage of the backbone network, attention mechanisms are introduced between different layer features, and the construction details of the attention mechanisms are shown in fig. 2. In order to make the standard SoftMax layer more efficient on the image, a hybrid SoftMax function was constructed, the normalization function was calculated as follows:
where w n is a linear vector that can be learned in network training, Is the arithmetic mean corresponding to each region k j on the feature map X,Is the similarity score for the nth portion.
Then, based on the construction of three feature transformations, the implementation details of the feature transformations are shown in fig. 3. First, the relationship modeling is performed on the current feature point q i by using other feature points k j on the feature map through self-conversion, and the calculation formula is as follows:
Wherein, Representing element-wise multiplication.
Secondly, by means of feature conversion from top to bottom, high-dimensional semantic information is utilizedContextual information for low-dimensional featuresModeling is carried out, and the calculation formula is as follows:
wi,j=Fmos(Feud(qi,n,kj,n))
In the formula, F eud (DEG) represents the Euclidean distance between two pixel points on the feature map.
And finally, modeling relevant information among feature map channels with different scales through feature conversion from bottom to top, wherein a specific calculation formula is as follows:
w=GAP(K)
Qatt=Fatt(Q,w)
Vdow=Fsconv(V)
Where F att (·) represents the outer product function, F sconv (·) is a 3×3 convolution with step size, F conv (·) is a 3×3 convolution used to refine, and F add (·) represents the element-wise addition of the two feature maps followed again by a 3×3 convolution process.
The construction of the multiscale feature fusion module based on the attention mechanism is as shown in fig. 3, firstly, self-conversion, top-down conversion and bottom-up conversion are respectively carried out on four layers of features in the middle of the coding network to obtain a plurality of enhanced advanced features, then, feature rearrangement is carried out on the enhanced features according to the scale, the features with the same size are cascaded with the original features on the coding network, finally, the channel dimension of the enhanced features is restored to the same dimension in input through a 3×3 convolution, and the context relation of a scene space structure is considered by the features output by the multiscale feature fusion module compared with the input features, so that the feature extraction capability of the feature extraction network is greatly improved.
Finally, a hole space pyramid pooling module is introduced for the image characteristics of the coding network, which are easy to lose in the depth map reconstruction process, and the structure of the module is shown in fig. 4. The method comprises the following specific steps:
1) A plurality of cavity space pyramid pooling modules are embedded between feature graphs with different resolutions in a crossing way, so that feature pyramid information with a large receptive field is captured, and the lost feature information in the up-sampling process is compensated;
2) The parameter-learnable deconvolution module is selected based on an upsampling method. For all up-sampling modules, the output of the previous cavity space pyramid pooling module is deconvoluted to make the size of the feature map twice as large as the original one, and then the corresponding features output by the multi-scale feature fusion module and the depth map of the previous scale coarse estimation are cascaded;
3) And for all the hole space pyramid pooling modules, respectively executing hole convolution with different hole rates on the input features, wherein the output of the module is cascade connection of different hole convolution output features.
The invention uses the data in NYU-Depth v2 and KITTI data sets to carry out experiments on the proposed monocular Depth estimation method based on multi-scale feature fusion, the NYU-Depth v2 data set is obtained by collecting indoor scenes by a Microsoft Kinect RGB-D camera and consists of a Depth map and an RGB map, and the KITTI data set is collected by a radar sensor and a vehicle-mounted camera under various road environments. These two data sets are the usual data sets for monocular depth estimation evaluation in indoor and outdoor scenarios.
Fig. 5 shows a comparison of the front and rear results of a multi-scale feature fusion module introduced by the present invention, in which line 1 is an input RGB image, line 2 is a true depth map, line 3 is a depth map predicted by a base network of the present invention, and line 4 is a depth map predicted after the multi-scale feature fusion module is introduced. From the figure, the depth map scale predicted by the basic network is very fuzzy, and the depth prediction of the edge of the object is not clear. After the feature extraction capability of the coding network is improved through the multi-scale feature fusion module, the prediction of the depth map becomes more accurate, and the edge profile of the object is clearer.
Fig. 6 is a comparison graph of the results before and after the hole space pyramid pooling module introduced in the decoding network, wherein the 1 st column in the graph is an input RGB image, the 2 nd column is a real depth map, the 3 rd column is a depth map predicted by the basic network of the invention, and the 4 th column is a depth map predicted after the hole space pyramid pooling module is introduced. It can be seen from the figure that after an improvement is made to the decoding network, the network is made more accurate for recovering the local details of some objects. And observing the bookshelf in the white frame of the 2 nd row, wherein the true depth map of the bookshelf is missing, and the network after improvement complements the missing depth map. Therefore, the introduction of the hole space pyramid pooling module can effectively solve the loss of image detail information, and the fine granularity prediction is completed.
Fig. 7 is a schematic diagram of a comparison of predicted depth maps between an overall monocular depth estimation network and ACAN monocular depth estimation network, modified in accordance with the present invention. With KITTI datasets, line 1 in the figure is the input RGB image, line 2 is the network estimated depth map of ACAN, and line 3 is the improved network estimated depth map of the present invention. As can be seen from the figure, ACAN networks are unclear in predicting the near-point targets and often fail in predicting the far-distance target depths, and the improved network predicted depth map can reserve clear outlines and detailed information for various targets, so that the monocular depth estimation accuracy is greatly improved.
Table 1 counts the average relative error, root mean square error, logarithmic average error and the accuracy under threshold for the method of the present invention versus other different algorithms under the NYU-Depth v2 dataset. From the data in table 1, it can be seen that the method of the present invention obtains good results on most indexes, and improves the accuracy of depth map estimation to some extent. Compared with SARPN, the method of the invention improves the threshold accuracy by 1.2%, and reduces the errors of other indexes to different degrees. Compared with ACAN, the method of the invention reduces the average relative error by 16 percent and improves the threshold accuracy by 5.3 percent. It is apparent that the advantages of using a multi-scale feature fusion module based on an attention mechanism are reflected.
TABLE 1
The monocular depth estimation algorithm based on multi-scale feature fusion effectively solves the problems of insufficient feature extraction of a feature extraction network, missing of spatial information and the like, and improves the accuracy of a network prediction depth map. Attention mechanisms are applied between hierarchical features of backbone networks for the first time, focusing on the spatial structural relationship between features. And a cavity space pyramid pooling module is introduced into the decoding network, so that the problem of image characteristic loss in the depth map reconstruction process is solved, and the receptive field of local convolution is enlarged. Simulation results show that the monocular depth estimation algorithm for multi-scale feature fusion provided by the invention has good effects on the aspects of precision, object edge contour reconstruction and the like, and has good performance.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.
Claims (2)
1. A monocular depth estimation method based on multi-scale feature fusion is characterized by comprising the following steps of: the method comprises the following steps:
s1: introducing a Non-Local attention mechanism, and constructing a mixed normalization function; the step S1 includes:
on the basis of Non-Local, a mixed SoftMox layer is constructed as a normalization function, and the calculation formula of the normalization function is as follows:
Wherein the method comprises the steps of Is the similarity score of the nth part, i is the current pixel point on the feature map, j is all the pixel points on the feature map, pi n is the nth aggregate weight, N is the number of feature map partitions, w n is a linear vector that can be learned in network training,Is the arithmetic mean corresponding to each region k j on the feature map X;
s2: introducing an attention mechanism among the features of the layer, the deep features and the shallow features of the feature extraction network, and calculating an associated information matrix among the features on the feature map; the step S2 specifically includes the following steps:
S21: the relationship modeling is performed on the current feature point q i by using other feature points k j on the feature map through self-conversion, and the calculation formula is as follows:
Where w i,j represents a spatial attention map, F mos (·) represents a normalization function, q i,n represents an index, k j,n represents a key, Representing element-by-element multiplication, X represents a self-transformed feature map, and v j represents a value;
s22: by means of top-down feature conversion, high-dimensional semantic information is utilized Contextual information for low-dimensional featuresModeling is carried out, and the calculation formula is as follows:
wi,j=Fmos(Feud(qi,n,kj,n))
Wherein F eud (·) represents the Euclidean distance between two pixel points on the feature map;
S23: through feature conversion from bottom to top, relevant information modeling is carried out among feature map channels with different scales, and a specific calculation formula is as follows:
w=GAP(K)
Qatt=Fatt(Q,w)
Vdow=Fsconv(V)
Where w represents the channel attention map, GAP represents the global average pooling, K represents the keys of the network shallow feature map, Q att represents the channel attention weighted features, F att (-) represents the outer product function, Q represents the index of the network deep feature map, V dow represents the downsampled feature map, F sconv (-) is a 3X 3 convolution with step size, V represents the values of the network shallow feature map, Representing the feature map after bottom-up conversion, F conv (·) is a3×3 convolution for refinement, and F add (·) represents that the two feature maps are subjected to element-by-element addition and then subjected to 3×3 convolution again;
S3: constructing a multi-scale feature fusion module; in the step S3, three kinds of feature conversion in the step S2 are respectively performed on the middle 4-layer features of the coding network to obtain a plurality of enhanced advanced features, then the enhanced features are subjected to feature rearrangement according to the scale, the features with the same size are cascaded with the original features on the coding network, and finally the channel dimension of the enhanced features is restored to the same dimension in input through a 3×3 convolution;
S4: and a hole space pyramid pooling module is introduced into the decoding network, so that the receptive field of convolution is enlarged, and the network learns more local detail information.
2. The monocular depth estimation method based on multi-scale feature fusion of claim 1, wherein: the step S4 specifically includes the following steps:
S41: a plurality of cavity space pyramid pooling modules are embedded between feature graphs with different resolutions in a crossing way, and feature pyramid information with a large receptive field is captured;
S42: selecting a deconvolution module with a parameter capable of being learned based on an upsampling method; for all up-sampling modules, the output of the previous cavity space pyramid pooling module is deconvoluted to make the size of the feature map twice as large as the original one, and then the corresponding features output by the multi-scale feature fusion module and the depth map of the previous scale coarse estimation are cascaded;
S43: and for all the hole space pyramid pooling modules, respectively executing hole convolution with different hole rates on the input features, and outputting the input features as cascade connection of different hole convolution output features.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111232322.5A CN113870335B (en) | 2021-10-22 | 2021-10-22 | Monocular depth estimation method based on multi-scale feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111232322.5A CN113870335B (en) | 2021-10-22 | 2021-10-22 | Monocular depth estimation method based on multi-scale feature fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113870335A CN113870335A (en) | 2021-12-31 |
CN113870335B true CN113870335B (en) | 2024-07-30 |
Family
ID=78997259
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111232322.5A Active CN113870335B (en) | 2021-10-22 | 2021-10-22 | Monocular depth estimation method based on multi-scale feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113870335B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114565655B (en) * | 2022-02-28 | 2024-02-02 | 上海应用技术大学 | Depth estimation method and device based on pyramid segmentation attention |
CN114693823B (en) * | 2022-03-09 | 2024-06-04 | 天津大学 | Magnetic resonance image reconstruction method based on space-frequency double-domain parallel reconstruction |
CN115359271B (en) * | 2022-08-15 | 2023-04-18 | 中国科学院国家空间科学中心 | Large-scale invariance deep space small celestial body image matching method |
CN115115686B (en) * | 2022-08-22 | 2023-07-18 | 中国矿业大学 | Mine image unsupervised monocular depth estimation method based on fine granularity multi-feature fusion |
CN115580564B (en) * | 2022-11-09 | 2023-04-18 | 深圳桥通物联科技有限公司 | Dynamic calling device for communication gateway of Internet of things |
CN116342675B (en) * | 2023-05-29 | 2023-08-11 | 南昌航空大学 | Real-time monocular depth estimation method, system, electronic equipment and storage medium |
CN116823908B (en) * | 2023-06-26 | 2024-09-03 | 北京邮电大学 | Monocular image depth estimation method based on multi-scale feature correlation enhancement |
CN117078236B (en) * | 2023-10-18 | 2024-02-02 | 广东工业大学 | Intelligent maintenance method and device for complex equipment, electronic equipment and storage medium |
CN118212637B (en) * | 2024-05-17 | 2024-09-03 | 山东浪潮科学研究院有限公司 | Automatic image quality assessment method and system for character recognition |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2747028B1 (en) * | 2012-12-18 | 2015-08-19 | Universitat Pompeu Fabra | Method for recovering a relative depth map from a single image or a sequence of still images |
CN108510535B (en) * | 2018-03-14 | 2020-04-24 | 大连理工大学 | High-quality depth estimation method based on depth prediction and enhancer network |
CN112785636B (en) * | 2021-02-18 | 2023-04-28 | 上海理工大学 | Multi-scale enhanced monocular depth estimation method |
-
2021
- 2021-10-22 CN CN202111232322.5A patent/CN113870335B/en active Active
Non-Patent Citations (1)
Title |
---|
基于多尺度特征融合的单目深度估计研究;邓朝龙;《万方数据》;20230706;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113870335A (en) | 2021-12-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113870335B (en) | Monocular depth estimation method based on multi-scale feature fusion | |
CN110443842B (en) | Depth map prediction method based on visual angle fusion | |
CN113705588B (en) | Twin network target tracking method and system based on convolution self-attention module | |
CN111047548B (en) | Attitude transformation data processing method and device, computer equipment and storage medium | |
CN111259945B (en) | Binocular parallax estimation method introducing attention map | |
Zhang et al. | Progressive hard-mining network for monocular depth estimation | |
CN112634296B (en) | RGB-D image semantic segmentation method and terminal for gate mechanism guided edge information distillation | |
CN115063445B (en) | Target tracking method and system based on multi-scale hierarchical feature representation | |
CN112396607A (en) | Streetscape image semantic segmentation method for deformable convolution fusion enhancement | |
CN110689599A (en) | 3D visual saliency prediction method for generating countermeasure network based on non-local enhancement | |
Xue et al. | Boundary-induced and scene-aggregated network for monocular depth prediction | |
CN112329780B (en) | Depth image semantic segmentation method based on deep learning | |
CN115035171B (en) | Self-supervision monocular depth estimation method based on self-attention guide feature fusion | |
Cho et al. | Semantic segmentation with low light images by modified CycleGAN-based image enhancement | |
CN114972378A (en) | Brain tumor MRI image segmentation method based on mask attention mechanism | |
CN113066089A (en) | Real-time image semantic segmentation network based on attention guide mechanism | |
CN116310098A (en) | Multi-view three-dimensional reconstruction method based on attention mechanism and variable convolution depth network | |
Khan et al. | Lrdnet: lightweight lidar aided cascaded feature pools for free road space detection | |
Xu et al. | AutoSegNet: An automated neural network for image segmentation | |
CN114565628B (en) | Image segmentation method and system based on boundary perception attention | |
Zhou et al. | Attention transfer network for nature image matting | |
CN116596966A (en) | Segmentation and tracking method based on attention and feature fusion | |
Yuan et al. | A novel deep pixel restoration video prediction algorithm integrating attention mechanism | |
CN116612385B (en) | Remote sensing image multiclass information extraction method and system based on depth high-resolution relation graph convolution | |
CN111860668A (en) | Point cloud identification method of deep convolution network for original 3D point cloud processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |