CN111325782A - Unsupervised monocular view depth estimation method based on multi-scale unification - Google Patents
Unsupervised monocular view depth estimation method based on multi-scale unification Download PDFInfo
- Publication number
- CN111325782A CN111325782A CN202010099283.5A CN202010099283A CN111325782A CN 111325782 A CN111325782 A CN 111325782A CN 202010099283 A CN202010099283 A CN 202010099283A CN 111325782 A CN111325782 A CN 111325782A
- Authority
- CN
- China
- Prior art keywords
- image
- input
- loss
- network
- original
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000012549 training Methods 0.000 claims abstract description 27
- 238000000605 extraction Methods 0.000 claims abstract description 15
- 238000013528 artificial neural network Methods 0.000 claims abstract description 8
- 238000011478 gradient descent method Methods 0.000 claims abstract description 8
- 238000012545 processing Methods 0.000 claims abstract description 7
- 238000005070 sampling Methods 0.000 claims abstract description 5
- 238000006243 chemical reaction Methods 0.000 claims description 10
- 238000009499 grossing Methods 0.000 claims description 8
- 238000003384 imaging method Methods 0.000 claims description 5
- 238000012360 testing method Methods 0.000 claims description 5
- 239000004576 sand Substances 0.000 claims description 3
- 238000013461 design Methods 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 230000000007 visual effect Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000011524 similarity measure Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000003466 welding Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10004—Still image; Photographic image
- G06T2207/10012—Stereo images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20016—Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20228—Disparity calculation for image-based rendering
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention belongs to the technical field of image processing, and discloses an unsupervised monocular view depth estimation method based on multi-scale unification, which comprises the following steps: s1: carrying out pyramid multi-scale processing on the input stereo image pair; s2: constructing a network framework of coding and decoding; s3: the features extracted at the coding stage are transmitted to a reverse convolution neural network to realize the feature extraction of input images with different scales; s4: uniformly up-sampling disparity maps of different scales to an original input size; s5: reconstructing an image by using the input original image and a corresponding disparity map; s6: the accuracy of image reconstruction is restrained; s7: training a network model by adopting a gradient descent method; s8: and fitting a corresponding disparity map according to the input image and the pre-training model. The design of the invention does not need to monitor network training by using real depth data, and easily-obtained binocular images are used as training samples, thereby greatly reducing the acquisition difficulty of network training and solving the problem of depth image holes caused by low-scale parallax image blurring.
Description
Technical Field
The invention relates to the technical field of image processing, in particular to an unsupervised monocular view depth estimation method based on multi-scale unification.
Background
With the development of science and technology and the explosive growth of information, people's attention to image scenes is slowly converted from two dimensions to three dimensions, and the three-dimensional information of objects is greatly convenient in daily life, wherein the three-dimensional information is most widely applied to an assistant driving system of driving scenes. Due to the abundance of information contained in the images, the visual sensors cover almost all relevant information required for driving, including but not limited to lane geometry, traffic signs, lights, object position and speed, etc. Among all forms of visual information, depth information plays a very important role in a driving assistance system. For example, collision avoidance systems issue collision warnings by calculating depth information between an obstacle and the vehicle. When the distance between the pedestrian and the vehicle is too small, the pedestrian protection system will automatically take measures to decelerate the vehicle. Therefore, the driving assistance system can accurately acquire the connection with the external environment only by acquiring the depth information between the current vehicle and other traffic participants in the driving scene, so that the early warning subsystem can work normally.
Many sensors are currently on the market that can obtain depth information, such as the laser radar of the jack company. Lidar can generate sparse three-dimensional point cloud data, but has the disadvantages of high cost and limited use scenes, so people look to recover three-dimensional structural information of scenes from images.
The traditional method for estimating the depth based on the image mostly is based on the geometric constraint and manual characteristics assumed by the shooting environment, and a wider method such as recovering the structure from motion is applied.
As convolutional neural networks have grown out of color on other visual tasks, many researchers have begun exploring the use of deep learning methods for monocular image depth estimation. People design various models to fully mine the connection between an original image and a depth map by utilizing the strong learning capacity of a neural network, so that the depth of a scene can be predicted according to an input image is trained, but as mentioned above, the real depth information of the scene is very unavailable, which means that people need to separate from the real depth label of the scene, and an unsupervised method is adopted to complete a depth estimation task. One of the unsupervised methods is to use the time sequence information of the monocular video as the surveillance signal, but such unsupervised depth estimation methods have motion of the camera itself due to the video information acquired during the motion process, and the relative pose of the camera between image sequences is unknown, which results in that the method needs to train an additional pose estimation network in addition to the depth estimation network, which undoubtedly increases the difficulty of the originally complex depth estimation task. In addition, due to the scale uncertainty of monocular video, the method can only obtain relative depth results, namely, only can obtain relative distance between each pixel in the image, and cannot obtain the distance from an object in the image to the camera. In addition, the unsupervised depth estimation method has the condition that the texture of the depth map is lost or even a hole is caused by the fuzzy details of the low-scale feature map, and the accuracy of the depth estimation is directly influenced.
Disclosure of Invention
The invention aims to solve the defects of the prior art, and provides an unsupervised monocular view depth estimation method based on multi-scale unification.
In order to achieve the purpose, the invention adopts the following technical scheme:
an unsupervised monocular view depth estimation method based on multi-scale unification comprises the following steps:
step S1: carrying out pyramid multi-scale processing on the input stereo image pair so as to extract features of multiple scales;
step S2: constructing a network framework of coding and decoding to obtain a disparity map which can be used for obtaining a depth map;
step S3: the features extracted in the encoding stage are transmitted to a reverse convolution neural network to realize the feature extraction of the input images with different scales, and the disparity maps of the input images with different scales are fitted in the decoding stage;
step S4: uniformly up-sampling disparity maps of different scales to an original input size;
step S5: reconstructing an image by using an input stereo image original image and a corresponding disparity map;
step S6: the accuracy of image reconstruction is constrained through appearance matching loss, left-right parallax conversion loss and parallax smoothing loss;
step S7: training a network model by using a gradient descent method by using a loss minimization idea;
step S8: in the testing stage, fitting a corresponding disparity map according to an input image and a pre-training model; and calculating a corresponding scene depth map by using a binocular imaging triangulation principle and the disparity map.
Preferably, in step S1, the input image is down-sampled to four sizes 1, 1/2, 1/4, 1/8 of the original image to form a pyramid input structure, and then the pyramid input structure is sent to the coding model for feature extraction.
Preferably, in step S2, a ResNet-101 network structure is used as a network model in the encoding stage, and the ResNet network structure adopts residual design, so that information loss is reduced while the network deepens.
Preferably, in step S3, in the encoding stage, feature extraction is performed on input images of different scales, and the extracted features are transmitted to the inverse convolutional neural network in the decoding stage to implement disparity map fitting, specifically:
step S41: respectively performing feature extraction on the input image with the pyramid structure through a ResNet-101 network in an encoding stage, and reducing the input image to 1/16 in the extraction process relative to input images with different sizes to obtain features of original input images 1/16, 1/32, 1/64 and 1/128;
step S42: inputting the features of four sizes obtained in the encoding stage into a network in the decoding stage, deconvoluting the input features layer by layer in the process to restore the input features to a pyramid structure of the original input image 1, 1/2, 1/4 and 1/8 sizes, and respectively fitting disparity maps of the images of 4 sizes according to the input features and the deconvolution network;
preferably, in the step S4, the disparity maps with the sizes of 1, 1/2, 1/4 and 1/8 of the original input image are collectively up-sampled to the size of the original input image.
Preferably, in step S5, since the 4-size disparity maps are uniformly upsampled to the original input size, the originally input left image I is usedlAnd the right parallax image drReconstruct a right imageOriginal right picture IrAnd left parallax map dlReconstruct the left image
Preferably, in step S6, the accuracy of the loss-constrained image reconstruction is calculated by using the original input left and right views and the reconstructed left and right views;
minimizing a loss function by adopting a gradient descent method, and training an image reconstruction network by adopting the method, specifically:
step S71: the loss function is composed of three parts, namely appearance matching loss and appearance loss CaSmoothing loss CsAnd parallax conversion loss Ct(ii) a For each term loss, the left and right graphs are computed in the same way, and the final loss function is composed of three terms:
step S72: respectively calculating losses on different parallax maps and the input original image on the original input size to obtain 4 losses CiI is 1,2,3,4, the total loss function is
Preferably, in step S7, the network model is trained by using a gradient descent method using the concept of minimizing loss.
Preferably, in the step S8, in the test stage, the input single image and the pre-training model are used to fit the disparity map corresponding to the input image, and according to the principle of triangulation of binocular imaging, the disparity map is used to generate a corresponding depth image, specifically:
where (i, j) is the pixel-level coordinate of any point in the image, D (i, j) is the depth value of the point, D (i, j) is the parallax value of the point, b is the known distance between two cameras, and f is the known focal length of the camera.
According to the unsupervised monocular view depth estimation method based on multi-scale unification, when a depth estimation problem is solved by a common depth learning method, a real depth image corresponding to an image needs to be input, but the real depth data is expensive to obtain, only sparse point cloud depth can be obtained, and the application requirements cannot be completely met; under the condition, the training process of the model is supervised by adopting image reconstruction loss, and binocular images which are relatively easy to acquire are used for training instead of real depth, so that unsupervised depth estimation is realized;
the unsupervised monocular view depth estimation method based on multi-scale unification provided by the invention has the advantages that the pyramid multi-scale processing is carried out on the input stereo image pair in the encoding stage, so that the influence of different size targets on the depth estimation is reduced;
according to the unsupervised monocular view depth estimation method based on multi-scale unification, all disparity maps are uniformly sampled to the original input size under the condition that a low-scale depth map is fuzzy, image reconstruction and loss calculation are carried out on the size, and the problem of depth map holes is solved;
the method is reasonable in design, real depth data are not needed to be used for monitoring network training, easily-obtained binocular images are used as training samples, the obtaining difficulty of network training is greatly reduced, and meanwhile the problem of depth image holes caused by low-scale parallax image blurring is solved.
Drawings
FIG. 1 is a flowchart of an unsupervised monocular view depth estimation method based on multi-scale unification according to the present invention;
FIG. 2 is a network model structure diagram of an unsupervised monocular view depth estimation method based on multi-scale unification according to the present invention;
FIG. 3 is a schematic diagram of a bottleneck module of a network structure of an unsupervised monocular view depth estimation method based on multi-scale unification according to the present invention;
FIG. 4 is a unified scale diagram of an unsupervised monocular view depth estimation method based on multi-scale unification according to the present invention;
fig. 5 is an estimation result graph of the unsupervised monocular view depth estimation method based on multi-scale unification on the classic driving data set KITTI, (a) is an input image, and (b) is a depth estimation result graph;
fig. 6 is a generalized effect diagram of a road scene real-time picture taken by an unsupervised monocular view depth estimation method based on multi-scale unification, where (a) is an input image and (b) is a depth estimation result diagram.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
Referring to fig. 1-6, an unsupervised monocular depth estimation method based on multi-scale unification, wherein an unsupervised depth monocular depth estimation network model is performed on a desktop workstation of a laboratory, a video card adopts NVIDIA GeForceGTX 1080Ti, a training system is ubuntu14.04, and a TensorFlow 1.4.0 is adopted as a framework building platform; training is performed on a classical driving data set, KITTI 2015 stereo data set.
As shown in fig. 1, the unsupervised monocular view depth estimation method based on multi-scale unification of the present invention specifically includes the following steps:
step S1: adopting a binocular data set in classic driving KITTI as a training set, setting a scale parameter to be 4, down-sampling images to 1/2, 1/4 and 1/8 of input images, adding input images with 4 sizes of original images to form a pyramid structure, and then sending the pyramid structure to a ResNet-101 neural network model for feature extraction;
step S2: constructing a network framework of coding and decoding to obtain a disparity map which can be used for obtaining a depth map; the specific process is as follows:
the residual structure in the ResNet network is shown in figure 3(a), firstly convolution of 1 × 1 is used for reducing characteristic dimensionality, and then convolution recovery of 1 × 1 is carried out, so that the parameter quantity is as follows:
1×1×256×64+3×3×64×64+1×1×64×256=69632
while a normal ResNet module is shown in fig. 3(b), the parameters are:
3×3×256×256×2=1179648
therefore, the parameter quantity can be greatly reduced by using the residual error module with the bottleneck structure;
step S3: the features extracted in the encoding stage are transmitted to a reverse convolution neural network to realize the feature extraction of the input images with different scales, and the disparity maps of the input images with different scales are fitted in the decoding stage, which specifically comprises the following steps:
step S31: in the network decoding process, in order to ensure that the size of a characteristic graph in the deconvolution neural network corresponds to the size of a ResNet-101 residual error network characteristic graph, the network directly connects part of the characteristic graph in the ResNet-101 coding process to the deconvolution neural network by using jump connection;
step S32: respectively performing feature extraction on the input image with the pyramid structure through a ResNet-101 network in an encoding stage, and reducing the input image to 1/16 in the extraction process relative to input images with different sizes to obtain features of original input images 1/16, 1/32, 1/64 and 1/128;
step S33: inputting the features of four sizes obtained in the encoding stage into a network in the decoding stage, deconvoluting the input features layer by layer in the process to restore the input features to a pyramid structure of the original input image 1, 1/2, 1/4 and 1/8 sizes, and respectively fitting approximate disparity maps of the images of 4 sizes according to the input features and the deconvolution network;
step S4: uniformly up-sampling disparity maps with the sizes of 1, 1/2, 1/4 and 1/8 of the original input image to the size of the original input image;
step S5: reconstructing an image by using the input original image and a corresponding disparity map, reconstructing a right view by using the disparity map and a left view corresponding to the disparity map, reconstructing a left image by using the original right image and the left disparity map, and finally comparing the reconstructed right image with the input left and right original images respectively;
step S6: then, the accuracy of image synthesis is constrained by using appearance matching loss, left-right parallax conversion loss and parallax smoothing loss; the method specifically comprises the following steps:
step S61: the loss function is composed of three parts, namely an appearance matching loss CaSmoothing loss CsAnd parallax conversion loss Ct;
In the image reconstruction process, the appearance matching loss C is firstly usedaDetermining pixel by pixel accuracy between the reconstructed image and the corresponding input image, the loss being determined by the structural similarity measure and L1Loss common component, taking the input left graph as an example:
the S is a structural similarity index which consists of three parts of brightness measurement, contrast measurement and structural contrast and is used for measuring the similarity between two images, and the more similar the two images are, the higher the similarity index value is; l is1Loss is the minimum absolute error loss for comparing the difference between two images on a pixel-by-pixel basisDistance, relative and L2α is a weight coefficient of the structural similarity in the appearance matching loss, and N is the total number of pixels in the image;
second, the smoothing loss αsThe method can relieve the situation that the disparity map is discontinuous due to overlarge local gradient and ensure the smoothness of the formed disparity map, and takes the left map as an example, the specific formula is as follows:
parallax conversion loss CtThe purpose of the method is to reduce the conversion error between a right disparity map generated according to a left map and a left disparity map generated according to a right map, and ensure the consistency between the two disparity maps, and the specific formula is as follows:
for each term loss, the left and right graphs are computed in the same way, and the final loss function is composed of three terms:
α thereinaTo weight the apparent match loss in the overall loss, αsTo weight the overall loss for the smoothing loss, αtThe conversion loss is weighted in the total loss;
step S62: respectively calculating losses on different parallax maps and the input original image on the original input size to obtain 4 losses CiI is 1,2,3,4, the total loss function is
Step S7: the method adopts the thought of minimizing loss and adopts a gradient descent method to train a network model, and specifically comprises the following steps: during the training of stereo image pairs, we used an open-source TensorFlow 1.40, a platform builds a depth estimation model, a KITTI data set with a stereo image pair is used as a training set, 29000 pairs in the data set are used for training the model, during training, an initial learning rate lr is set, after 40 periods, the learning rate is changed into a half of the current learning rate every 10 periods, a total of 70 periods are trained, meanwhile, the batch processing size is set to bs, namely bs pictures are processed once, an Adam optimizer is used for optimizing the model, and β pairs are set1,β2Controlling the attenuation rate of the weight coefficient moving average value, and finally completing all training in 34 hours on a GTX 1080Ti experiment platform;
TABLE 1 loss function and training parameters
Step S8: in the testing stage, fitting a corresponding disparity map according to an input image and a pre-training model; calculating a corresponding scene depth map from the disparity map by using a binocular imaging triangulation principle; in the KITTI road driving data set adopted in the experiment, the baseline distance of the camera is fixed to be 0.54m, the focal length of the camera is changed according to different camera models, different camera models are represented as different image sizes in the KITTI data set, and the corresponding relation is as follows:
the conversion formula of depth and parallax is specifically:
wherein, (i, j) is the pixel coordinate of any point in the image, D (i, j) is the depth value of the point, and D (i, j) is the parallax value of the point;
therefore, a disparity map corresponding to the input image is fitted according to the input image and a network model pre-trained by using a binocular image reconstruction principle, and a corresponding scene depth map of the input image shot by the camera can be calculated according to the known camera focal length and the base line distance.
The standard parts used in the invention can be purchased from the market, the special-shaped parts can be customized according to the description of the specification and the accompanying drawings, the specific connection mode of each part adopts conventional means such as bolts, rivets, welding and the like mature in the prior art, the machines, the parts and equipment adopt conventional models in the prior art, and the circuit connection adopts the conventional connection mode in the prior art, so that the detailed description is omitted.
Claims (9)
1. An unsupervised monocular view depth estimation method based on multi-scale unification is characterized by comprising the following steps:
step S1: carrying out pyramid multi-scale processing on the input stereo image pair so as to extract features of multiple scales;
step S2: constructing a network framework of coding and decoding to obtain a disparity map which can be used for calculating a depth map;
step S3: the features extracted in the encoding stage are transmitted to a reverse convolution neural network to realize the feature extraction of the input images with different scales, and the disparity maps of the input images with different scales are fitted in the decoding stage;
step S4: uniformly up-sampling disparity maps of different scales to an original input size;
step S5: reconstructing an image by using the input original image and a corresponding disparity map;
step S6: the accuracy of image reconstruction is constrained through appearance matching loss, left-right parallax conversion loss and parallax smoothing loss;
step S7: training a network model by using a gradient descent method by using a loss minimization idea;
step S8: in the testing stage, fitting a corresponding disparity map according to an input image and a pre-training model; and calculating a corresponding scene depth map by using a binocular imaging triangulation principle and the disparity map.
2. The method as claimed in claim 1, wherein in step S1, the input image is down-sampled to four sizes of 1, 1/2, 1/4, 1/8 of the original image to form a pyramid input structure, and then sent to the coding model for feature extraction.
3. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in step S2, a ResNet-101 network structure is adopted as a network model in an encoding stage.
4. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in step S3, feature extraction is performed on input images of different scales in an encoding stage, and the extracted features are transmitted to a deconvolution neural network in a decoding stage to implement disparity map fitting, specifically:
step S41: respectively performing feature extraction on the input image with the pyramid structure through a ResNet-101 network in an encoding stage, and reducing the input image to 1/16 in the extraction process relative to input images with different sizes to obtain features of original input images 1/16, 1/32, 1/64 and 1/128;
step S42: inputting the features of four sizes obtained in the encoding stage into the network in the decoding stage, deconvoluting the input features layer by layer in the process to restore the input features to the pyramid structures of the original input images 1, 1/2, 1/4 and 1/8 sizes, and respectively fitting the disparity maps of the images of 4 sizes according to the input features and the deconvolution network.
5. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in the step S4, the disparity maps with the size of 1, 1/2, 1/4, 1/8 of the original input image are unified up-sampled to the size of the original input image.
6. The unsupervised monocular view depth based on multi-scale unification of claim 1The estimation method is characterized in that in the step S5, since the disparity maps of 4 sizes are uniformly up-sampled to the original input size, the originally input left map I is usedlAnd the right parallax image drReconstruct a right imageOriginal right picture IrAnd left parallax map dlReconstruct the left image
7. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in step S6, accuracy of image reconstruction is constrained by calculating loss using the original input left and right views and the reconstructed left and right views;
minimizing a loss function by adopting a gradient descent method, and training an image reconstruction network by adopting the method, specifically:
step S71: the loss function is composed of three parts, namely appearance matching loss and appearance loss CaSmoothing loss CsAnd parallax conversion loss Ct(ii) a For each term loss, the left and right graphs are computed in the same way, and the final loss function is composed of three terms:
8. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in step S7, a network model is trained by using a gradient descent method using an idea of minimizing loss.
9. The unsupervised monocular view depth estimation method based on multi-scale unification as claimed in claim 1, wherein in the step S8, in the testing stage, an input single image and a pre-trained model are used to fit a disparity map corresponding to the input image, and according to a principle of triangulation of binocular imaging, the disparity map is used to generate a corresponding depth image, specifically:
where (i, j) is the pixel-level coordinate of any point in the image, D (i, j) is the depth value of the point, D (i, j) is the parallax value of the point, b is the known distance between two cameras, and f is the known focal length of the camera.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010099283.5A CN111325782A (en) | 2020-02-18 | 2020-02-18 | Unsupervised monocular view depth estimation method based on multi-scale unification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010099283.5A CN111325782A (en) | 2020-02-18 | 2020-02-18 | Unsupervised monocular view depth estimation method based on multi-scale unification |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111325782A true CN111325782A (en) | 2020-06-23 |
Family
ID=71172765
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010099283.5A Pending CN111325782A (en) | 2020-02-18 | 2020-02-18 | Unsupervised monocular view depth estimation method based on multi-scale unification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111325782A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111915660A (en) * | 2020-06-28 | 2020-11-10 | 华南理工大学 | Binocular disparity matching method and system based on shared features and attention up-sampling |
CN112396645A (en) * | 2020-11-06 | 2021-02-23 | 华中科技大学 | Monocular image depth estimation method and system based on convolution residual learning |
CN112700532A (en) * | 2020-12-21 | 2021-04-23 | 杭州反重力智能科技有限公司 | Neural network training method and system for three-dimensional reconstruction |
CN113139999A (en) * | 2021-05-14 | 2021-07-20 | 广东工业大学 | Transparent object single-view multi-scale depth estimation method and system |
CN113313732A (en) * | 2021-06-25 | 2021-08-27 | 南京航空航天大学 | Forward-looking scene depth estimation method based on self-supervision learning |
CN114283089A (en) * | 2021-12-24 | 2022-04-05 | 北京的卢深视科技有限公司 | Jump acceleration based depth recovery method, electronic device, and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110163246A (en) * | 2019-04-08 | 2019-08-23 | 杭州电子科技大学 | The unsupervised depth estimation method of monocular light field image based on convolutional neural networks |
CN110443843A (en) * | 2019-07-29 | 2019-11-12 | 东北大学 | A kind of unsupervised monocular depth estimation method based on generation confrontation network |
CN110490919A (en) * | 2019-07-05 | 2019-11-22 | 天津大学 | A kind of depth estimation method of the monocular vision based on deep neural network |
-
2020
- 2020-02-18 CN CN202010099283.5A patent/CN111325782A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110163246A (en) * | 2019-04-08 | 2019-08-23 | 杭州电子科技大学 | The unsupervised depth estimation method of monocular light field image based on convolutional neural networks |
CN110490919A (en) * | 2019-07-05 | 2019-11-22 | 天津大学 | A kind of depth estimation method of the monocular vision based on deep neural network |
CN110443843A (en) * | 2019-07-29 | 2019-11-12 | 东北大学 | A kind of unsupervised monocular depth estimation method based on generation confrontation network |
Non-Patent Citations (1)
Title |
---|
王欣盛 等: "《基于卷积神经网络的单目深度估计》" * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111915660A (en) * | 2020-06-28 | 2020-11-10 | 华南理工大学 | Binocular disparity matching method and system based on shared features and attention up-sampling |
CN111915660B (en) * | 2020-06-28 | 2023-01-06 | 华南理工大学 | Binocular disparity matching method and system based on shared features and attention up-sampling |
CN112396645A (en) * | 2020-11-06 | 2021-02-23 | 华中科技大学 | Monocular image depth estimation method and system based on convolution residual learning |
CN112396645B (en) * | 2020-11-06 | 2022-05-31 | 华中科技大学 | Monocular image depth estimation method and system based on convolution residual learning |
CN112700532A (en) * | 2020-12-21 | 2021-04-23 | 杭州反重力智能科技有限公司 | Neural network training method and system for three-dimensional reconstruction |
CN112700532B (en) * | 2020-12-21 | 2021-11-16 | 杭州反重力智能科技有限公司 | Neural network training method and system for three-dimensional reconstruction |
CN113139999A (en) * | 2021-05-14 | 2021-07-20 | 广东工业大学 | Transparent object single-view multi-scale depth estimation method and system |
CN113313732A (en) * | 2021-06-25 | 2021-08-27 | 南京航空航天大学 | Forward-looking scene depth estimation method based on self-supervision learning |
CN114283089A (en) * | 2021-12-24 | 2022-04-05 | 北京的卢深视科技有限公司 | Jump acceleration based depth recovery method, electronic device, and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111325782A (en) | Unsupervised monocular view depth estimation method based on multi-scale unification | |
CN109685842B (en) | Sparse depth densification method based on multi-scale network | |
US20210142095A1 (en) | Image disparity estimation | |
CN113936139B (en) | Scene aerial view reconstruction method and system combining visual depth information and semantic segmentation | |
CN109741383A (en) | Picture depth estimating system and method based on empty convolution sum semi-supervised learning | |
CN108062769B (en) | Rapid depth recovery method for three-dimensional reconstruction | |
CN108961327A (en) | A kind of monocular depth estimation method and its device, equipment and storage medium | |
CN110689562A (en) | Trajectory loop detection optimization method based on generation of countermeasure network | |
AU2021103300A4 (en) | Unsupervised Monocular Depth Estimation Method Based On Multi- Scale Unification | |
CN110517306B (en) | Binocular depth vision estimation method and system based on deep learning | |
CN113313732A (en) | Forward-looking scene depth estimation method based on self-supervision learning | |
CN110009675B (en) | Method, apparatus, medium, and device for generating disparity map | |
EP3953903A1 (en) | Scale-aware monocular localization and mapping | |
CN111027415A (en) | Vehicle detection method based on polarization image | |
CN114119889B (en) | Cross-modal fusion-based 360-degree environmental depth completion and map reconstruction method | |
CN110942484A (en) | Camera self-motion estimation method based on occlusion perception and feature pyramid matching | |
CN117593702B (en) | Remote monitoring method, device, equipment and storage medium | |
CN117197388A (en) | Live-action three-dimensional virtual reality scene construction method and system based on generation of antagonistic neural network and oblique photography | |
CN117058474B (en) | Depth estimation method and system based on multi-sensor fusion | |
CN116342675B (en) | Real-time monocular depth estimation method, system, electronic equipment and storage medium | |
CN117437274A (en) | Monocular image depth estimation method and system | |
CN115187959B (en) | Method and system for landing flying vehicle in mountainous region based on binocular vision | |
Guan et al. | Improved RefineDNet algorithm for precise environmental perception of autonomous earthmoving machinery under haze and fugitive dust conditions | |
CN112927139B (en) | Binocular thermal imaging system and super-resolution image acquisition method | |
CN113706599B (en) | Binocular depth estimation method based on pseudo label fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200623 |
|
RJ01 | Rejection of invention patent application after publication |