CN110766152B

CN110766152B - Method and apparatus for training deep neural networks

Info

Publication number: CN110766152B
Application number: CN201810844262.4A
Authority: CN
Inventors: 李斐; 田虎
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2018-07-27
Filing date: 2018-07-27
Publication date: 2023-08-04
Anticipated expiration: 2038-07-27
Also published as: CN110766152A

Abstract

The present disclosure relates to a method and apparatus for training a deep neural network. According to one embodiment of the present disclosure, the method comprises the steps of: aiming at each training sample image in the training set, a depth neural network is used for generating a corresponding estimated depth image according to the training sample images; calculating a loss of the training sample image based on the training sample depth map and the estimated depth map of the training sample image; and optimizing parameters of the neural network based on the calculated losses, wherein the losses include a loss term calculated based on a comparison of at least one planar region in the training sample depth map and a corresponding region in the estimated depth map. The trained depth neural network obtained by the method and the device can improve the accuracy of estimating the depth map under the condition of using a single input image.

Description

Method and apparatus for training deep neural networks

Technical Field

The present disclosure relates generally to the field of three-dimensional image processing, and in particular, to a method and apparatus for training a deep neural network.

Background

In recent years, with the development of three-dimensional imaging technology, digitized three-dimensional objects have been widely used in many fields of people's daily life, such as augmented reality, digital museums, three-dimensional printing, and the like. An important aspect of three-dimensional imaging techniques is three-dimensional reconstruction techniques. Depth information is critical for three-dimensional reconstruction. In general, depth may be estimated from a single image, two images, or more than two images. Where only one image is needed to estimate depth from a single image, and the estimated depth may be conveniently used for computer vision applications such as object recognition, pose estimation, and the like.

For depth map estimation based on a single image, the input is one image, and the output is the corresponding depth map. In recent years, mainstream depth map estimation methods based on a single image generally use a depth neural network to mine the relationship between visual information and depth information. In order to obtain a more accurate depth estimation result, researchers have proposed many effective methods for training a depth neural network, such as reasonably utilizing gradient data of depth, introducing multi-scale image information, and the like. However, most existing methods focus only on the color and depth data itself. Since the relationship between an image and its corresponding depth map is quite complex, it is difficult to learn a direct mapping model between the two. To further improve the performance of depth estimation based on a single image, other additional information needs to be employed. Therefore, when estimating a depth map using a single image, the accuracy of the depth estimation needs to be improved.

Disclosure of Invention

A brief summary of the disclosure will be presented below in order to provide a basic understanding of some aspects of the disclosure. It should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.

According to one aspect of the present disclosure, there is provided a method for training a deep neural network, the method comprising the steps of: aiming at each training sample image in the training set, a depth neural network is used for generating a corresponding estimated depth image according to the training sample images; calculating a loss of the training sample image based on the training sample depth map and the estimated depth map of the training sample image; and optimizing parameters of the neural network based on the calculated losses, wherein the losses include a loss term calculated based on a comparison of at least one planar region in the training sample depth map and a corresponding region in the estimated depth map.

According to one aspect of the present disclosure, there is provided a method for training a deep neural network, the method comprising the steps of: aiming at each training sample image in the training set, a depth neural network is used for generating a corresponding estimated depth image according to the training sample images; detecting at least one planar region from a training sample depth map of the training sample image; calculating a loss of the training sample image based on the training sample depth map and the estimated depth map of the training sample image; and optimizing parameters of the deep neural network based on the calculated loss; wherein the penalty comprises a penalty term calculated based on a comparison of at least one planar region in the training sample depth map and a corresponding region in the estimated depth map; wherein at least one planar region is detected from the training sample depth map by: calculating gradient values of a plurality of pixels of the training sample depth map, clustering based on the gradient values and the positions of the plurality of pixels to obtain at least one cluster, and extracting a connected domain in an image area corresponding to each cluster as one plane area in at least one plane area; and wherein each of the at least one planar region satisfies: in the planar region, the average value of the absolute values of the second order gradient values is below a predetermined second order gradient threshold.

According to another aspect of the present invention, there is provided an apparatus for training a deep neural network, comprising: a depth map estimation unit configured to generate a corresponding estimated depth map from the training sample image using the depth neural network; a loss calculation unit configured to calculate a loss of the training sample image based on the training sample depth map of the training sample image and the estimated depth map; and a parameter optimization unit configured to optimize parameters of the neural network based on the calculated loss; wherein the penalty comprises a penalty term calculated based on a comparison of at least one planar region in the training sample depth map and a corresponding region in the estimated depth map.

The trained depth neural network obtained by the method and the device can improve the accuracy of estimating the depth map under the condition of using a single input image.

Drawings

The present disclosure may be better understood by reference to the following description taken in conjunction with the accompanying drawings, which are incorporated in and form a part of this specification, along with the following detailed description. In the drawings:

FIG. 1 is an exemplary flowchart of a method for training a deep neural network, according to one embodiment of the present disclosure;

FIG. 2 is an exemplary flowchart of a method for training a deep neural network, according to another embodiment of the present disclosure;

FIG. 3 is an exemplary flowchart of a method for determining a depth map according to one embodiment of the present disclosure;

FIG. 4 is an exemplary block diagram of an apparatus for training a deep neural network, according to one embodiment of the present disclosure; and

fig. 5 is an exemplary block diagram of an information processing apparatus according to one embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual embodiment are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions may be made to achieve the developers' specific goals, and that these decisions may vary from one implementation to another.

It should be noted here that, in order to avoid obscuring the present disclosure due to unnecessary details, only the device structures closely related to the scheme according to the present disclosure are shown in the drawings, and other details not greatly related to the present disclosure are omitted.

It is to be understood that the present disclosure is not limited to the described embodiments due to the following description with reference to the drawings. In this context, embodiments may be combined with each other, features replaced or borrowed between different embodiments, one or more features omitted in one embodiment, where possible.

A method of the present invention for training a deep neural network is described below with reference to fig. 1.

Fig. 1 is an exemplary flowchart of a method 100 for training a deep neural network, according to one embodiment of the present disclosure.

When a single image is input and an estimated depth map is obtained by using the depth neural network, the accuracy of the estimated depth map is affected by parameters of the depth neural network. Therefore, before the accurate estimated depth map is expected to be obtained by actually using the depth neural network, the depth neural network needs to be trained by using the training sample image so as to optimize the parameters of the depth neural network.

Before training the deep neural network, the deep neural network needs to be constructed, and initial parameters are set. Since the construction of the deep neural network is a conventional technique, it will not be described in detail herein.

The data input when training the deep neural network includes: the training sample image (IM (i), i=1, 2, … …) and the training sample depth map (DMt (i), i=1, 2, … …) corresponding to each training sample image. The training sample depth map herein may be acquired by a depth camera or computationally by other processing means. The training sample image and the training sample depth map form a training set.

Steps 101, 103 and 105 are performed for each training sample image in the training set.

At step 101, a depth map is estimated. Specifically, a corresponding estimated depth map DMe (i) is generated from the training sample image IM (i) using a depth neural network.

At step 103, the loss is calculated based on the planar region alignment. Specifically, a loss Lt (i) of the training sample image IM (i) is calculated based on the training sample depth map DMt (i) and the estimated depth map DMe (i) of the training sample image, wherein the loss comprises a loss term L (i) calculated based on a comparison of at least one planar region (Rs (i, j), j=1, 2, … …) in the training sample depth map DMt (i) and a corresponding region (Re (i, j), j=1, 2, … …) in the estimated depth map DMe (i).

It may be ensured that each training sample image has at least one planar area when the training set is entered. The training sample image may be skipped if no planar region is detected when the planar region is detected before step 103.

The planar area may be a floor, ceiling, road surface, facade of a building, etc.

At step 105, the parameters are optimized. Specifically, parameters of the deep neural network are optimized based on the calculated loss Lt (i). In general, the parameters of the deep neural network are optimized by minimizing the loss function, which is common to those skilled in the art and therefore not described in detail.

Parameters of the deep neural network may be optimized once for each batch of training sample images. The number of training sample images in a batch of training sample images is at least 1, preferably a plurality, e.g. 10 or 50. That is, the training set may be divided into a plurality of batches.

The detection of planar areas is described below.

If the planar region is not pre-labeled for each training sample image, the method 100 further includes detecting the planar region from each training sample depth map (DMt (i), i=1, 2, … …). The number of planar areas in the training sample depth map DMt (i) may be 1,2, 3 or more. The number threshold may be set, and when the number of detected planar areas in the training sample depth map DMt (i) reaches the number threshold, the continuation of the detection of the planar areas is stopped.

In the depth map, the depth of the plane area is uniformly changed, so the gradient of the depth is constant, and the second-order gradient of the depth is zero. This property can be used to detect planar regions in the training sample depth map.

Note that, the training sample depth map may have distortion and errors, so that the gradient of the actual depth changes around the constant of the overall gradient of the planar region, and the second-order gradient of the actual depth changes around zero.

Note that the gradient of depth may include a gradient in the x-direction and a gradient in the y-direction; the second order gradient includes taking the derivative of the gradient in the x direction about x, taking the derivative of the gradient in the y direction about y, taking the derivative of the gradient in the y direction about x, taking the derivative of the gradient in the y direction about y.

In one embodiment, the planar region is detected from the training sample depth map DMt (i) by: calculating gradient values of a plurality of pixels of the training sample depth map DMt (i), and determining a planar area based on the gradient values of the plurality of pixels; wherein each of the planar regions satisfies: in the planar region, a percentage of the number of pixels whose absolute value of a difference of the gradient value and a constant indicating an overall gradient of the planar region is lower than a first threshold value is higher than a first predetermined percentage. The gradient value of each pixel in the training sample depth map can be calculated, or only the gradient value of a part of pixels can be calculated according to a certain rule. The overall gradient of the planar region may be determined in a number of ways, for example by: performing preliminary clustering based on gradient change to obtain a plurality of candidate plane areas, and if the gradient change is smaller than a preset degree on a line segment with a preset length in one candidate plane area, calculating the gradient average value of each pixel point on the line segment, and taking the average value as the overall gradient of the candidate plane area; alternatively, a gradient probability distribution histogram (the ordinate is the percentage of the pixels whose gradient values are within the gradient range k, and the abscissa is the gradient) of the training sample depth map DMt (i) is determined, and the gradient value corresponding to the maximum point is used as the overall gradient of the candidate plane region. The planar region is screened from the candidate planar region. For example: assuming that the pixels of the candidate planar area are 10000 and the first predetermined percentage is 90%, if the absolute value of the difference between the gradient value in the x direction of 9010 pixels and the constant (x direction) indicating the overall gradient of the planar area is lower than the first threshold value, and the absolute value of the difference between the gradient value in the y direction of 9020 pixels and the constant (y direction) indicating the overall gradient of the planar area is lower than the first threshold value, the percentage of the number of pixels whose absolute value of the difference between the gradient value and the constant indicating the overall gradient of the planar area is lower than the first threshold value is 90.1% and greater than the first predetermined percentage is 90%, the candidate planar area is selected as the planar area. The percentage is calculated by taking the number of pixels for which the gradient is calculated as a denominator, for example, if the candidate plane area has 10000 pixels and the number of pixels for which the gradient is calculated is 5000, the percentage is calculated by taking 5000 as a denominator. Alternatively, it is also possible to define each of the planar areas to satisfy: in the planar region, an average value of absolute values of differences of gradient values and constants indicating an overall gradient of the planar region is below a predetermined gradient deviation threshold. For example, if gradient values are calculated for 8000 pixels in a candidate planar region, an arithmetic average of absolute values of differences between the gradient values and constants indicating the overall gradient of the planar region may be calculated, and if the arithmetic average is lower than a predetermined gradient deviation threshold, the candidate planar region is determined as the planar region. When calculating the average value, the calculation may be performed for all pixels in the candidate plane area, or may be performed for some pixels in the candidate plane area. The average value may be: arithmetic mean, geometric mean, or root mean square mean, etc.

In one embodiment, the planar region is detected from the training sample depth map by: calculating second order gradient values of a plurality of pixels of a training sample depth map, and determining a planar region based on the second order gradient values of the plurality of pixels; wherein each of the planar regions satisfies: in the planar region, the percentage of the number of pixels whose absolute value of the second order gradient value is smaller than the second threshold value is higher than a second predetermined percentage. The second-order gradient value of each pixel in the training sample depth map can be calculated, or only the second-order gradient value of part of pixels can be calculated according to a certain rule. In this embodiment, the percentage is calculated with the number of pixels for which the second order gradient is calculated as the denominator. Alternatively, it is also possible to define each of the planar areas to satisfy: in the planar region, the average value of the absolute values of the second order gradient values is below a predetermined second order gradient threshold. For example, if the second order gradient values are calculated for 8000 pixels in the candidate plane region, and an arithmetic average of absolute values of these second order gradient values is calculated, for example, if the arithmetic average is smaller than a predetermined second order gradient threshold value, the candidate plane region is determined as a plane region. The calculation of the average second-order gradient value may be performed for all pixels in the candidate plane region or may be performed for some pixels in the candidate plane region.

In one embodiment, the planar region is detected from the training sample depth map by: the planar region is detected from the training sample depth map by: calculating gradient values of a plurality of pixels of the training sample depth map, clustering based on the gradient values and the positions of the plurality of pixels to obtain at least one cluster, and extracting a connected domain in an image region corresponding to each cluster as one plane region in at least one plane region. The gradient value of each pixel in the training sample depth map can be calculated, or only the gradient value of a part of pixels can be calculated according to a certain rule. In addition to considering the depth gradient value of each pixel when clustering, in order to ensure that pixels in the same cluster are not far away from each other, position information of the pixels is also introduced. That is, in the clustering process, each pixel p uses a four-dimensional feature vector [. V. ] _x D(p) ▽ _y D(p) p _x p _y ] ^T Description is made. Wherein% _x D (p) and ] _y D (p) represents the gradient of the depth of the pixel p in the horizontal and vertical directions, respectively, (p) _x ,p _y ) Is the position coordinate of the pixel p in the image. All pixels in the training sample depth map may be clustered by, but not limited to, hierarchical clustering. When hierarchical clustering is performed, the minimum distance among the distances between the pixels in the cluster a and the pixels in the cluster B can be used as the distance between the cluster a and the cluster B, and if the minimum distance is smaller than a preset threshold value, the cluster a and the cluster B are combined into one cluster.

To facilitate subsequent loss terms calculated based on a comparison of at least one planar region in the training sample depth map and a corresponding region in the estimated depth map, the detected planar region may be post-processed (the planar region before processing may be referred to as a "candidate planar region"). For example: filling the voids in the planar region; if the number of pixels in one candidate planar area is smaller than the third threshold value, prohibiting the use of the candidate planar area as a planar area; the candidate planar region is subjected to an etching operation (e.g., removing an annular region of the peripheral edge of the candidate planar region). One or more of the foregoing post-treatments may be performed on the candidate planar regions.

After determining at least one planar region Rs (i, j) of the training depth map DMt (i), a local loss can be calculated for that planar region. The calculation of the loss is described below.

In the prior art, the absolute value of the difference between the estimated depth of each pixel of the estimated depth map of each training image and the depth of the corresponding pixel of the training depth map is typically calculated when calculating the loss function. The invention further increases the loss term calculated based on the comparison of the planar region in the training sample depth map and the corresponding region in the estimated depth map according to the preset combination coefficient.

The calculated loss term L (i) includes the local loss L (i, j) for the planar region Rs (i, j). The loss term L (i) may be an accumulation of local loss L (i, j) with respect to j.

In one embodiment, the calculated loss term includes a local loss L (i, j) for the planar region Rs (i, j) determined by: determining a plane parameter of a corresponding plane corresponding to the plane region Rs (i, j) in the training sample depth map DMt (i), and calculating a sum or average value of distances between a first predetermined number of three-dimensional points and the corresponding plane in the corresponding region Re (i, j) corresponding to the plane region Rs (i, j) in the estimated depth map DMe (i) based on the plane parameter as a local loss L (i, j). If an average is used as the local loss, the first predetermined number may be different for different training sample images.

In one embodiment, the calculated loss term includes a local loss L (i, j) for the planar region Rs (i, j) determined by: determining the normal of the corresponding plane corresponding to the planar region Rs (i, j) in the training sample depth map DMt (i), and calculating the sum or average of the absolute values of the inner products of the second predetermined number of vectors and the normal of the corresponding plane in the corresponding region Re (i, j) corresponding to the planar region in the estimated depth map DMe (i) as the local loss. The length of the vector is preferably uniform. If not uniform, the absolute values may be normalized by length and then summed or averaged. The length of the normal is preferably unit length and if not unit length, the absolute value needs to be normalized by the normal length before summing or averaging. The vector may be determined by: in the corresponding region Re (i, j), two three-dimensional points calculated based on the estimated depth are randomly selected, and one three-dimensional point is used as a start point of the vector and the other three-dimensional point is used as an end point of the vector.

In one embodiment, the calculated loss term includes a local loss L (i, j) for the planar region Rs (i, j) determined by: the sum or average of the absolute values of the second order gradients of the third predetermined number of pixels in the corresponding region Re (i, j) corresponding to the planar region Rs (i, j) in the estimated depth map DMe (i) is calculated as the local loss L (i, j).

Fig. 2 is an exemplary flowchart of a method 200 for training a deep neural network according to another embodiment of the present disclosure. The method 200 is described below with reference to fig. 2.

After the deep neural network is constructed and a training set is obtained that includes the training sample image IM (i) and the training sample depth map DMt (i), execution of the method 200 may begin (i=1, 2, … …).

For each training sample image in the training set, steps 201, 203, 205, 207 are performed.

At step 201, a depth map is estimated. Specifically, a corresponding estimated depth map DMe (i) is generated from the training sample image IM (i) using a depth neural network.

At step 203, a planar region is detected. Specifically, the planar region Rs (i, j) is detected from the training sample depth map DMt (i) by: and calculating gradient values of a plurality of pixels of the training sample depth map, clustering based on the gradient values and the positions of the plurality of pixels to obtain at least one cluster, and extracting connected domains in an image area corresponding to each cluster as a plane area. The number of planar areas is at least 1. Wherein each of the planar regions satisfies: in the planar region, the average value of the absolute values of the second order gradient values is below a predetermined second order gradient threshold. If the obtained planar area after clustering does not meet the requirement of the average second-order gradient value, discarding the planar area, namely, when the loss is calculated in the follow-up process, not comparing the calculation loss based on the planar area. Further, hierarchical clustering may be employed for clustering. The method 200 may also include the post-processing described previously.

At step 205, the loss is calculated based on the planar region contrast. Specifically, a loss Lt (i) of the training sample image IM (i) is calculated based on the training sample depth map DMt (i) and the estimated depth map DMe (i) of the training sample image, wherein the loss comprises a loss term L (i) calculated based on a comparison of at least one planar region (Rs (i, j), j=1, 2, … …) in the training sample depth map DMt (i) and a corresponding region (Re (i, j), j=1, 2, … …) in the estimated depth map DMe (i).

At step 207, the parameters are optimized. Specifically, based on the calculated loss L ₀ (i) Parameters of the deep neural network are optimized. Typically, parameters of the deep neural network are optimized by minimizing a loss function.

Parameters of the deep neural network may be optimized once for a batch of training sample images.

Steps 201 and 203 may be performed in parallel or sequentially. Step 201 may be advanced, or step 203 may be advanced.

The method for determining a depth map using the neural network of the present invention is described below.

Fig. 3 is an exemplary flowchart of a method 300 for determining a depth map according to one embodiment of the present disclosure. In step 301, a deep neural network is trained. Specifically, in training, a loss term calculated based on a comparison of at least one planar region in the training sample depth map and a corresponding region in the estimated depth map is considered. An example of a training method may be the aforementioned method 100 or 200. In step 303, a depth map is determined based on the input single image. Specifically, the trained depth neural network can determine a corresponding depth map based on a single image; thus, the method 300 is able to determine a depth map based on a single image and the accuracy of the depth map is improved because additional constraints on the loss by planar regions are taken into account during training.

The apparatus for training a deep neural network of the present invention is described below.

Fig. 4 is an exemplary block diagram of an apparatus 400 for training a deep neural network according to one embodiment of the present disclosure. The apparatus 400 comprises: a depth map estimation unit 401, a loss calculation unit 403, and a parameter optimization unit 405. The depth map estimation unit 401 is configured to generate a corresponding estimated depth map from the training sample images using the depth neural network. The loss calculation unit 403 is configured to calculate a loss of the training sample image based on the training sample depth map and the estimated depth map of the training sample image. The parameter optimization unit 405 is configured to optimize parameters of the neural network based on the calculated loss. Wherein the penalty comprises a penalty term calculated based on a comparison of at least one planar region in the training sample depth map and a corresponding region in the estimated depth map. Referring to the method 100, the apparatus 400 may further comprise a planar region detection unit configured to detect at least one planar region from the training sample depth map. The planar region detection unit may be configured to detect at least one planar region from the training sample depth map in a plurality of ways, for example, calculate gradient values of a plurality of pixels of the training sample depth map, cluster based on the gradient values and positions of the plurality of pixels to obtain at least one cluster, and extract a connected region in an image region corresponding to each cluster as one planar region in the at least one planar region.

In one embodiment, the present disclosure also provides a storage medium. The storage medium has stored thereon program code readable by an information processing device, which when executed on the information processing device causes the information processing device to perform the above-described method according to the present invention (including a method for training a deep neural network and a method for determining a depth map). Storage media include, but are not limited to, floppy diskettes, compact discs, magneto-optical discs, memory cards, memory sticks, and the like.

Fig. 5 is an exemplary block diagram of an information processing apparatus 500 according to one embodiment of the present disclosure.

In fig. 5, a Central Processing Unit (CPU) 501 performs various processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 to a Random Access Memory (RAM) 503. The RAM 503 also stores data and the like necessary when the CPU 501 executes various processes, as necessary.

The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output interface 505 is also connected to the bus 504.

The following components are connected to the input/output interface 505: an input section 506 including a soft keyboard or the like; an output portion 507 including a display such as a Liquid Crystal Display (LCD), a speaker, and the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet or a local area network.

The drive 510 is also connected to the input/output interface 505 as needed. A removable medium 511 such as a semiconductor memory or the like is installed on the drive 510 as needed, so that a computer program read therefrom is installed to the storage section 508 as needed.

The CPU 501 may run the program code of the aforementioned method for determining a depth map or the program code of the method for training a depth neural network.

The method for training the deep neural network, the method for determining the depth map and the device thereof have at least the following beneficial effects: the trained depth neural network obtained by the method or the device can improve the accuracy of estimating the depth map under the condition of using a single input image.

While the invention has been disclosed in the context of specific embodiments thereof, it will be appreciated that those skilled in the art may devise various modifications, including combinations and substitutions of features between embodiments, as appropriate, within the spirit and scope of the appended claims. Such modifications, improvements, or equivalents are intended to be included within the scope of this invention.

It should be emphasized that the term "comprises/comprising" when used herein is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.

Furthermore, the methods of the embodiments of the present invention are not limited to being performed in the temporal order described in the specification or shown in the drawings, but may be performed in other temporal orders, in parallel, or independently. Therefore, the order of execution of the methods described in the present specification does not limit the technical scope of the present invention.

Additional note

1. A method for training a deep neural network, the method comprising the steps of:

for each training sample image in the training set,

generating a corresponding estimated depth map according to the training sample image by using the depth neural network;

calculating a loss of the training sample image based on a training sample depth map of the training sample image and the estimated depth map; and

optimizing parameters of the deep neural network based on the calculated loss;

wherein the penalty comprises a penalty term calculated based on a comparison of at least one planar region in the training sample depth map and a corresponding region in the estimated depth map.

2. The method of appendix 1, further comprising detecting the at least one planar region from the training sample depth map.

3. The method of supplementary note 2, wherein the at least one planar region is detected from the training sample depth map by: calculating gradient values of a plurality of pixels of the training sample depth map, and determining the at least one planar region based on the gradient values of the plurality of pixels;

wherein each of the at least one planar region satisfies: in the planar region, a percentage ratio of the number of pixels whose absolute value of the difference of the gradient value and the constant indicating the overall gradient of the planar region is lower than a first threshold value is higher than a first predetermined percentage, or an average value of the absolute value of the difference of the gradient value and the constant indicating the overall gradient of the planar region is lower than a predetermined gradient deviation threshold value.

4. The method of supplementary note 2, wherein the at least one planar region is detected from the training sample depth map by: calculating second order gradient values of a plurality of pixels of the training sample depth map, and determining the at least one planar region based on the second order gradient values of the plurality of pixels;

wherein each of the at least one planar region satisfies: in the planar region, the percentage of the number of pixels whose absolute value of the second order gradient value is smaller than the second threshold value is higher than a second predetermined percentage, or the average value of the absolute values of the second order gradient values is lower than a predetermined second order gradient threshold value.

5. The method of supplementary note 2, wherein the at least one planar region is detected from the training sample depth map by: calculating gradient values of a plurality of pixels of the training sample depth map, clustering based on the gradient values and the positions of the plurality of pixels to obtain at least one cluster, and extracting a connected domain in an image area corresponding to each cluster as one plane area in the at least one plane area.

6. The method according to one of supplementary notes 3 to 5, wherein the calculated loss term comprises a local loss for one of the at least one planar area determined by: and determining a plane parameter of a corresponding plane corresponding to the plane region in the training sample depth map, and calculating the sum or average value of the distances between a first preset number of three-dimensional points in the corresponding region corresponding to the plane region in the estimated depth map and the corresponding plane as the local loss based on the plane parameter.

7. The method according to one of supplementary notes 3 to 5, wherein the calculated loss term comprises a local loss for one of the at least one planar area determined by: and determining the normal line of a corresponding plane corresponding to the plane area in the training sample depth map, and calculating the sum or average value of the absolute value of the inner product of a second preset number of vectors in the corresponding area corresponding to the plane area in the estimated depth map and the normal line of the corresponding plane as the local loss.

8. The method according to one of supplementary notes 3 to 5, wherein the calculated loss term comprises a local loss for one of the at least one planar area determined by: and calculating the sum or average value of the absolute values of the second-order gradients of the third predetermined number of pixels in the corresponding region corresponding to the plane region in the estimated depth map as the local loss.

9. The method of one of supplementary notes 3 to 5, wherein determining at least one planar area in the estimated depth map further comprises: filling the void in one of the at least one planar region.

10. The method according to one of supplementary notes 3 to 5, wherein in determining at least one planar area in the estimated depth map, if the number of pixels in a candidate planar area of the at least one planar area is smaller than a third threshold value, the candidate planar area is prohibited from being used as one of the at least one planar area.

11. The method according to one of supplementary notes 3 to 5, wherein, when determining at least one planar region in the estimated depth map, a candidate planar region of the at least one planar region is eroded, and the eroded candidate planar region is taken as one of the at least one planar region.

12. The method of supplementary note 3, wherein the gradient values include a horizontal gradient value and a vertical gradient value.

13. The method of supplementary note 5, wherein the at least one cluster is obtained by performing hierarchical clustering.

14. The method of appendix 1, the penalty further comprising a sum or average of absolute values of differences in depths of the estimated depth map of the training sample image and pixels of the training sample depth map.

15. A method for training a deep neural network, the method comprising the steps of:

for each training sample image in the training set,

detecting at least one planar region from a training sample depth map of the training sample image;

optimizing parameters of the deep neural network based on the calculated loss;

wherein the penalty comprises a penalty term calculated based on a comparison of the at least one planar region in the training sample depth map and a corresponding region in the estimated depth map;

wherein the at least one planar region is detected from the training sample depth map by: calculating gradient values of a plurality of pixels of the training sample depth map, clustering based on the gradient values and the positions of the plurality of pixels to obtain at least one cluster, and extracting a connected domain in an image area corresponding to each cluster as one plane area in the at least one plane area; and is also provided with

Wherein each of the at least one planar region satisfies: in the planar region, the average value of the absolute values of the second order gradient values is below a predetermined second order gradient threshold.

16. The method of claim 15, further comprising post-processing of at least one of:

filling a void in one of the at least one planar region;

in determining at least one planar region in the estimated depth map, if the number of pixels in a candidate planar region of the at least one planar region is less than a third threshold, prohibiting the use of the candidate planar region as one planar region of the at least one planar region; and

and corroding the candidate plane area of the at least one plane area when determining the at least one plane area in the estimated depth map, and taking the corroded candidate plane area as one plane area in the at least one plane area.

17. The method of claim 15, wherein the at least one cluster is obtained by hierarchical clustering.

18. An apparatus for training a deep neural network, comprising:

a depth map estimation unit configured to generate a corresponding estimated depth map from training sample images using the depth neural network;

a loss calculation unit configured to calculate a loss of the training sample image based on a training sample depth map of the training sample image and the estimated depth map; and

a parameter optimization unit configured to optimize parameters of the deep neural network based on the calculated loss;

19. The apparatus of supplementary note 18, further comprising a planar region detection unit configured to detect the at least one planar region from the training sample depth map.

20. The apparatus of appendix 19, wherein the planar region detection unit is further configured to detect the at least one planar region from the training sample depth map by: calculating gradient values of a plurality of pixels of the training sample depth map, clustering based on the gradient values and the positions of the plurality of pixels to obtain at least one cluster, and extracting a connected domain in an image area corresponding to each cluster as one plane area in the at least one plane area.

Claims

1. A method for training a deep neural network of predictive depth maps, the method comprising the steps of:

for each training sample image in the training set,

optimizing parameters of the deep neural network based on the calculated loss;

wherein the penalty comprises a penalty term calculated based on a comparison of at least one planar region in the training sample depth map and a corresponding region in the estimated depth map; and is also provided with

Additional constraints related to the at least one planar region are considered in determining the penalty term.

2. The method of claim 1, further comprising detecting the at least one planar region from the training sample depth map by: calculating gradient values of a plurality of pixels of the training sample depth map, and determining the at least one planar region based on the gradient values of the plurality of pixels;

3. The method of claim 1, further comprising detecting the at least one planar region from the training sample depth map by: calculating second order gradient values of a plurality of pixels of the training sample depth map, and determining the at least one planar region based on the second order gradient values of the plurality of pixels;

4. The method of claim 1, further comprising detecting the at least one planar region from the training sample depth map by: calculating gradient values of a plurality of pixels of the training sample depth map, clustering based on the gradient values and the positions of the plurality of pixels to obtain at least one cluster, and extracting a connected domain in an image area corresponding to each cluster as one plane area in the at least one plane area.

5. The method of one of claims 2 to 4, wherein the calculated loss term comprises a local loss for one of the at least one planar region determined by: and determining a plane parameter of a corresponding plane corresponding to the plane region in the training sample depth map, and calculating the sum or average value of the distances between a first preset number of three-dimensional points in the corresponding region corresponding to the plane region in the estimated depth map and the corresponding plane as the local loss based on the plane parameter.

6. The method of one of claims 2 to 4, wherein the calculated loss term comprises a local loss for one of the at least one planar region determined by: and determining the normal line of a corresponding plane corresponding to the plane area in the training sample depth map, and calculating the sum or average value of the absolute value of the inner product of a second preset number of vectors in the corresponding area corresponding to the plane area in the estimated depth map and the normal line of the corresponding plane as the local loss.

7. The method of one of claims 2 to 4, wherein the calculated loss term comprises a local loss for one of the at least one planar region determined by: and calculating the sum or average value of the absolute values of the second-order gradients of the third predetermined number of pixels in the corresponding region corresponding to the plane region in the estimated depth map as the local loss.

8. A method for training a deep neural network of predictive depth maps, the method comprising the steps of:

for each training sample image in the training set,

optimizing parameters of the deep neural network based on the calculated loss;

wherein the at least one planar region is detected from the training sample depth map by: calculating gradient values of a plurality of pixels of the training sample depth map, clustering based on the gradient values and the positions of the plurality of pixels to obtain at least one cluster, and extracting a connected domain in an image area corresponding to each cluster as one plane area in the at least one plane area;

wherein each of the at least one planar region satisfies: in the planar region, an average value of absolute values of the second-order gradient values is below a predetermined second-order gradient threshold; and is also provided with

9. The method of claim 8, wherein the at least one cluster is obtained by performing hierarchical clustering.

10. An apparatus for training a depth neural network of a predicted depth map, comprising: