CN109472820B

CN109472820B - Monocular RGB-D camera real-time face reconstruction method and device

Info

Publication number: CN109472820B
Application number: CN201811222294.7A
Authority: CN
Inventors: 徐枫; 冯铖锃
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-10-19
Filing date: 2018-10-19
Publication date: 2021-03-16
Anticipated expiration: 2038-10-19
Also published as: CN109472820A

Abstract

The invention discloses a monocular RGB-D camera real-time face reconstruction method and a monocular RGB-D camera real-time face reconstruction device, wherein the method comprises the following steps: detecting the positions of the human face characteristic points on the input human face RGB image through an advanced human face characteristic point detection algorithm; obtaining the three-dimensional coordinates of each characteristic point of the current frame according to the positions of the characteristic points of the human face; acquiring the current three-dimensional coordinates of each human face characteristic point on the key frame; obtaining global rigid motion from the key frame to each frame according to the three-dimensional coordinates and the current three-dimensional coordinates to obtain a rigid motion result; using the rigid motion result as the initialization of ICP to finely adjust the rigid motion of the human face; rigid motion results are applied to the keyframe model to update the TSDF representation of the model. The method effectively removes the depth of the non-face area, removes the influence of non-rigid motion, and can improve the accuracy of rigid motion estimation by using the human face characteristic points.

Description

Monocular RGB-D camera real-time face reconstruction method and device

Technical Field

The invention relates to the technical field of three-dimensional reconstruction, in particular to a monocular RGB-D camera real-time face reconstruction method and device.

Background

In the related art, the three-dimensional reconstruction technology is a research hotspot in the fields of computer vision and computer graphics, is one of core technologies in the fields of virtual reality/augmented reality, automatic driving, robots and the like, and has wide application. In recent years, there have been many efforts to use consumer-level depth cameras (such as microsoft Kinect, intel real sense, etc.) to perform real-time three-dimensional reconstruction of general scenes and objects.

Most of the work is based on the ICP algorithm to carry out rigid registration on the reconstructed geometric part and the current frame input point cloud, and the rigid motion (global rotation and translation) of the current frame relative to the key frame is estimated. The method has great limitation when the camera moves fast or the reconstruction object moves fast, and reconstruction failure caused by inaccurate rigid motion estimation often occurs.

Disclosure of Invention

The present application is based on the recognition and discovery by the inventors of the following problems:

the real-time three-dimensional reconstruction of the monocular RGB-D camera is a research hotspot in the field of computer graphics and computer vision, and how to rapidly and accurately reconstruct information such as the geometry, reflectivity, ambient illumination and the like of a common object according to input data of the monocular RGB-D camera is an important research topic. Advanced reconstruction techniques in recent years mostly use Iterative Closest Point (ICP) based algorithms in the geometric registration stage, but such methods generally only cope with slow camera or object motion.

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, an object of the present invention is to provide a monocular RGB-D camera real-time face reconstruction method, which effectively removes the depth of non-face areas, removes the influence of non-rigid motion, and can improve the accuracy of rigid motion estimation by using human face feature points.

The invention also aims to provide a monocular RGB-D camera real-time face reconstruction method.

In order to achieve the above object, an embodiment of the present invention provides a method for reconstructing a real-time face of a monocular RGB-D camera, including the following steps: step S1: detecting the positions of the human face characteristic points on the input human face RGB image through an advanced human face characteristic point detection algorithm; step S2: obtaining the three-dimensional coordinates of each characteristic point of the current frame according to the positions of the human face characteristic points; step S3: acquiring the current three-dimensional coordinates of each human face characteristic point on the key frame; step S4: obtaining global rigid motion from the key frame to each frame according to the three-dimensional coordinates and the current three-dimensional coordinates to obtain a rigid motion result; step S5: using the rigid motion result as the initialization of ICP to finely adjust the rigid motion of the human face; step S6: applying the rigid motion results to the keyframe model to update the TSDF representation of the model.

The monocular RGB-D camera real-time face reconstruction method provided by the embodiment of the invention takes the particularity of a face structure into consideration, utilizes an advanced face image characteristic point detection technology to improve the accuracy of the monocular RGB-D camera real-time face reconstruction, and is a novel method for estimating global rigid motion aiming at special targets such as faces, so that the real-time three-dimensional face reconstruction during the rapid motion of the faces can be processed, the depth of non-face areas is effectively removed, the influence of non-rigid motion is removed, and the accuracy of rigid motion estimation can be improved by utilizing the face characteristic points.

In addition, the monocular RGB-D camera real-time face reconstruction method according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, the step S1 further includes: dividing the feature points of the outer circle of the face into a left feature point and a right feature point; respectively fitting the left characteristic point and the right characteristic point by using an exponential function curve, and after fitting, reserving depth data of a region which is positioned above the two curves simultaneously; and setting the depth values outside the area to be zero.

Further, in an embodiment of the present invention, the step S2 further includes: and searching the corresponding position of each feature point on the depth image according to the residual internal feature points, and obtaining the three-dimensional coordinates of each feature point of the current frame through back projection of the internal reference matrix of the depth camera.

Further, in an embodiment of the present invention, the step S3 further includes: rendering the corresponding depth map of the current reconstructed model, and obtaining the current three-dimensional coordinates of the feature points on the key frame model.

Further, in an embodiment of the present invention, the step S4 further includes: modeling global rigid motion as an optimization problem, the optimization target being:

wherein R and t represent respectively the rigid rotation and translation to be optimized, n is the number of characteristic points,

three-dimensional coordinates representing the ith feature point of the current input frame,

and representing the three-dimensional coordinates of the ith characteristic point of the key frame.

In order to achieve the above object, an embodiment of another aspect of the present invention provides a monocular RGB-D camera real-time face reconstruction device, including the following steps: the detection module is used for detecting the positions of the human face characteristic points on the input human face RGB image through an advanced human face characteristic point detection algorithm; the first processing module is used for obtaining the three-dimensional coordinates of each characteristic point of the current frame according to the position of the face characteristic point; the acquisition module is used for acquiring the current three-dimensional coordinates of each facial feature point on the key frame; the second processing module is used for obtaining global rigid motion from the key frame to each frame according to the three-dimensional coordinates and the current three-dimensional coordinates so as to obtain a rigid motion result; the initialization module is used for using the rigid motion result as the initialization of ICP (inductively coupled plasma) to finely adjust the rigid motion of the face; and the updating module is used for applying the rigid motion result to the key frame model so as to update the TSDF representation of the model.

The monocular RGB-D camera real-time human face reconstruction device provided by the embodiment of the invention takes the particularity of the human face structure into consideration, utilizes the advanced human face image characteristic point detection technology to improve the accuracy of the monocular RGB-D camera real-time human face reconstruction, and provides a new method for estimating global rigid motion aiming at special targets such as human faces.

In addition, the monocular RGB-D camera real-time face reconstruction device according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, the detection module is further configured to divide the feature points outside the face into a left feature point and a right feature point, respectively curve-fit the left feature point and the right feature point with an exponential function, and after the curve-fit, keep the depth data of the area located above the two curves at the same time, and set the depth values outside the area to zero.

Further, in an embodiment of the present invention, the first processing module is further configured to find a corresponding position of each feature point on the depth image according to the remaining internal feature points, and obtain a three-dimensional coordinate of each feature point of the current frame through back projection of a reference matrix of the depth camera.

Further, in an embodiment of the present invention, the obtaining module is further configured to render the currently reconstructed model into its corresponding depth map, and obtain the current three-dimensional coordinates of the feature points on the keyframe model using the obtained current three-dimensional coordinates.

Further, in an embodiment of the present invention, the second processing module is further configured to model the global rigid motion as an optimization problem, the optimization being aimed at:

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart of a monocular RGB-D camera real-time face reconstruction method according to one embodiment of the present invention;

FIG. 2 is a flow chart of a monocular RGB-D camera real-time face reconstruction method according to one embodiment of the present invention;

FIG. 3 is a graph comparing the estimation of rigid motion using feature points and the estimation of ICP according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a monocular RGB-D camera real-time face reconstruction device according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The monocular RGB-D camera real-time face reconstruction method and apparatus proposed in the embodiment of the present invention will be described below with reference to the accompanying drawings, and first, the monocular RGB-D camera real-time face reconstruction method proposed in the embodiment of the present invention will be described with reference to the accompanying drawings.

Fig. 1 is a flowchart of a monocular RGB-D camera real-time face reconstruction method according to an embodiment of the present invention.

As shown in fig. 1, the monocular RGB-D camera real-time face reconstruction method includes the following steps:

step S1: and detecting the positions of the human face characteristic points on the input human face RGB image through an advanced human face characteristic point detection algorithm.

Further, in an embodiment of the present invention, the step S1 further includes: dividing the feature points of the outer circle of the face into a left feature point and a right feature point; respectively fitting the left characteristic point and the right characteristic point by using an exponential function curve, and after fitting, reserving depth data of an area which is positioned above the two curves simultaneously; the depth values outside the area are set to zero.

It can be understood that, as shown in fig. 2, the position of the face feature point is detected on the input face RGB image by using an advanced face feature point detection algorithm, and this step only uses the feature points outside the face; the embodiment of the invention divides the characteristic points of the outer circle of the face into a left half and a right half, and each half is fitted by an exponential function curve, after fitting, only the depth data of the area which is positioned above the two curves at the same time is reserved, and the part outside the area is not considered to belong to the face part, and the depth values at the positions are set to be zero.

It should be noted that, in the embodiment of the present invention, an RGB image with a resolution of 640 × 480 and a depth image with the same resolution are used, and the RGB image and the depth image are aligned in advance, so that pixels at the same position on two images have a corresponding relationship, which is only an example and is not limited specifically herein.

Specifically, the removing of the depth data of the non-face area in the embodiment of the present invention specifically includes:

the input depth image usually contains depth data of a non-face area, such as shoulders, a background and the like, because the movement of the face is inconsistent with the movement of the non-face area in the rotation process, non-rigid movement is generated on the whole, and the depth data of the non-face area is automatically removed by utilizing a curve enclosed by peripheral feature points of the face.

The embodiment of the invention divides the characteristic points of the outer circle of the face into a left half and a right half, each half is fitted by an exponential function curve, after fitting, an area which is simultaneously positioned above the two curves is reserved, and the parts outside the area do not belong to the face part, so the depth values at the positions are set to be zero.

Step S2: and obtaining the three-dimensional coordinates of each characteristic point of the current frame according to the positions of the human face characteristic points.

It is understood that, as shown in fig. 2, the embodiment of the present invention uses the pixel coordinates of the face feature point on the RGB image in step S1, which is contrary to step S1, and uses the remaining internal feature points instead of the outer feature points, and finds the corresponding position of each feature point on the depth image, where the RGB image and the depth image have the same pixel coordinates because they are aligned. And finally, obtaining the three-dimensional coordinates { p _ i ^ live | p _ i ^ live ^ R ^3, i ^ 1, …, n } of each feature point of the current frame by utilizing the back projection of the internal parameter matrix of the depth camera.

Step S3: and acquiring the current three-dimensional coordinates of each human face characteristic point on the key frame.

It can be understood that, as shown in fig. 2, when calculating the three-dimensional coordinates of each facial feature point on the key frame at this time, the embodiment of the present invention needs to render the currently reconstructed model into its corresponding depth map, and then calculate the three-dimensional coordinates of the feature points on the key frame model using a method similar to that in step S2

Step S4: and obtaining global rigid motion from the key frame to each frame according to the three-dimensional coordinates and the current three-dimensional coordinates to obtain a rigid motion result.

It will be appreciated that from the three-dimensional coordinates of these two sets of feature points, global rigid motions R and t from keyframe to each frame are calculated, as shown in fig. 2, and we model this as an optimization problem.

Wherein, in an embodiment of the present invention, the step S4 further includes: modeling global rigid motion as an optimization problem, the optimization target being:

Step S5: the rigid motion result is used as an initialization of the ICP to fine tune the rigid motion of the face.

It will be appreciated that embodiments of the present invention, as shown in fig. 2, use this estimate as an initialization of ICP to further fine tune the rigid motion of the face.

Step S6: rigid motion results are applied to the keyframe model to update the TSDF representation of the model.

It will be appreciated that embodiments of the present invention, as shown in fig. 2, act on the keyframe model to update the TSDF representation of the model based on the current estimated rigid motion results. Also, a comparison graph of the rigid motion estimated using the feature points and the ICP method estimated effect is shown in fig. 3.

Specifically, according to steps S2-S6, the embodiment of the present invention accurately estimates the global rigid motion by using the feature points, and specifically includes:

in each frame, three-dimensional coordinates of two groups of feature points are calculated, one group is the three-dimensional coordinates of the feature points of the current input frame, and the other group is the three-dimensional coordinates of the feature points after the key frame is updated.

The three-dimensional coordinates of the face feature points of the current frame input point cloud can be calculated by the pixel coordinates of the two-dimensional feature points detected on the RGB image of the frame and the internal reference matrix of the depth camera: after the feature points are detected on the RGB image, the pixel coordinates of each feature point corresponding to the depth map are searched, and the three-dimensional coordinates of each feature point in the depth camera coordinate system can be obtained through the depth camera internal reference matrix

Where n is the number of the used face feature points, the feature points in the outer circle are not used here, because the semantic positions of the feature points in the outer circle on the face in different postures may change.

For the three-dimensional coordinates of the human face feature points on the key frame, as the human face model reconstructed in each frame is updated, the surface of the human face model is more and more complete, and the noise is continuously reduced, in each frame, the current reconstructed model needs to be rendered into a corresponding depth map, and then the three-dimensional coordinates of the feature points on the key frame model are calculated by using a method similar to the method for calculating the three-dimensional coordinates of the feature points on the input point cloud

According to the three-dimensional coordinates of the two groups of feature points, global rigid motions R and t from the key frame to each frame are calculated and modeled as an optimization problem, and the optimization target is as follows:

in the embodiment of the invention, the estimation result is used as the initialization of ICP (inductively coupled plasma) to further fine-tune rigid motion of the face, and because some feature points are shielded under a large posture, for example, when the angle of the side face exceeds 45 degrees, the calculation of three-dimensional coordinates of part of the feature points is inaccurate.

And finally, according to the currently estimated rigid motion result, acting on the key frame model, and updating the TSDF representation of the model.

According to the monocular RGB-D camera real-time face reconstruction method provided by the embodiment of the invention, the specificity of a face structure is considered, the accuracy of the real-time face reconstruction of the monocular RGB-D camera is improved by utilizing an advanced face image characteristic point detection technology, and the real-time three-dimensional face reconstruction in the process of rapid face motion can be processed by aiming at a novel method for estimating global rigid motion of special targets such as the face, so that the depth of a non-face area is effectively removed, the influence of the non-rigid motion is removed, and the accuracy of rigid motion estimation can be improved by utilizing the face characteristic points.

Next, a monocular RGB-D camera real-time face reconstruction device according to an embodiment of the present invention will be described with reference to the drawings.

As shown in fig. 4, the monocular RGB-D camera real-time face reconstruction device 10 includes: a detection module 100, a first processing module 200, an acquisition module 300, a second processing module 400, an initialization module 500, and an update module 600.

The detection module 100 is configured to detect the positions of the face feature points on the input RGB face image through an advanced face feature point detection algorithm. The first processing module 200 is configured to obtain three-dimensional coordinates of each feature point of the current frame according to the position of the feature point of the face. The obtaining module 300 is configured to obtain current three-dimensional coordinates of each facial feature point on the key frame. The second processing module 400 is configured to obtain a global rigid motion from the key frame to each frame according to the three-dimensional coordinates and the current three-dimensional coordinates, so as to obtain a rigid motion result. The initialization module 500 is used to use the rigid motion result as the initialization of ICP to fine tune the rigid motion of the face. The update module 600 is configured to apply the rigid motion results to the keyframe model to update the TSDF representation of the model. The device 10 of the embodiment of the invention effectively removes the depth of the non-face area, removes the influence of non-rigid motion, and can improve the accuracy of rigid motion estimation by using the human face characteristic points.

Further, in an embodiment of the present invention, the detection module 100 is further configured to divide the feature points outside the face into left and right feature points, curve-fit the left and right feature points with an exponential function, respectively, and after the curve-fit, retain depth data of an area located above the two curves at the same time, and set depth values outside the area to zero.

Further, in an embodiment of the present invention, the first processing module 200 is further configured to find a corresponding position of each feature point on the depth image according to the remaining internal feature points, and obtain three-dimensional coordinates of each feature point of the current frame through back projection of a reference matrix of the depth camera.

Further, in an embodiment of the present invention, the obtaining module 300 is further configured to render the currently reconstructed model into its corresponding depth map, and use the current three-dimensional coordinates of the feature points on the obtained key frame model.

Further, in an embodiment of the present invention, the second processing module 400 is further configured to model the global rigid motion as an optimization problem, the optimization being aimed at:

It should be noted that the explanation of the embodiment of the monocular RGB-D camera real-time face reconstruction method is also applicable to the monocular RGB-D camera real-time face reconstruction device of the embodiment, and details are not repeated here.

According to the monocular RGB-D camera real-time human face reconstruction device provided by the embodiment of the invention, the specificity of a human face structure is considered, the accuracy of reconstructing the human face by the monocular RGB-D camera in real time is improved by utilizing an advanced human face image characteristic point detection technology, and the real-time three-dimensional reconstruction of the human face during the rapid motion of the human face can be processed by a novel method for estimating global rigid motion aiming at special targets such as the human face, so that the depth of a non-human face area is effectively removed, the influence of the non-rigid motion is removed, and the accuracy of rigid motion estimation can be improved by utilizing human face characteristic points.

In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the invention and to simplify the description, and are not intended to indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and are therefore not to be considered limiting of the invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A monocular RGB-D camera real-time face reconstruction method is characterized by comprising the following steps:

step S1: detecting the positions of the human face characteristic points on the input human face RGB image through an advanced human face characteristic point detection algorithm; the step S1 further includes: dividing the feature points of the outer circle of the face into a left feature point and a right feature point; respectively fitting the left characteristic point and the right characteristic point by using an exponential function curve, and after fitting, reserving depth data of a region which is positioned above the two curves simultaneously; setting the depth value outside the area to be zero;

step S2: obtaining the three-dimensional coordinates of each characteristic point of the current frame according to the positions of the human face characteristic points;

step S3: acquiring the current three-dimensional coordinates of each human face characteristic point on the key frame;

step S4: obtaining global rigid motion from the key frame to the current frame according to the three-dimensional coordinates and the current three-dimensional coordinates to obtain a rigid motion result;

step S5: using the rigid motion result as initialization based on an iterative closest point algorithm to finely adjust the rigid motion of the face; and

step S6: applying the rigid motion results to the keyframe model to update the TSDF representation of the model.

2. The monocular RGB-D camera real-time face reconstruction method according to claim 1, wherein the step S2 further includes:

and searching the corresponding position of each feature point on the depth image according to the residual internal feature points, and obtaining the three-dimensional coordinates of each feature point of the current frame through back projection of the internal reference matrix of the depth camera.

3. The monocular RGB-D camera real-time face reconstruction method according to claim 1, wherein the step S3 further includes:

rendering the corresponding depth map of the current reconstructed model, and acquiring the current three-dimensional coordinates of the feature points on the key frame model.

4. The monocular RGB-D camera real-time face reconstruction method according to claim 1, wherein the step S4 further includes:

modeling global rigid motion as an optimization problem, the optimization target being:

5. A monocular RGB-D camera real-time human face reconstruction device is characterized by comprising the following steps:

the detection module is used for detecting the positions of the human face characteristic points on the input human face RGB image through an advanced human face characteristic point detection algorithm; the detection module is further used for dividing the feature points of the outer circle of the face into a left feature point and a right feature point, fitting the left feature point and the right feature point respectively by using an exponential function curve, reserving depth data of an area which is simultaneously positioned above the two curves after fitting, and setting depth values outside the area to be zero;

the first processing module is used for obtaining the three-dimensional coordinates of each characteristic point of the current frame according to the position of the face characteristic point;

the acquisition module is used for acquiring the current three-dimensional coordinates of each facial feature point on the key frame;

the second processing module is used for obtaining global rigid motion from the key frame to the current frame according to the three-dimensional coordinates and the current three-dimensional coordinates so as to obtain a rigid motion result;

the initialization module is used for using the rigid motion result as the initialization based on the iterative closest point algorithm so as to finely adjust the rigid motion of the face; and

and the updating module is used for applying the rigid motion result to the key frame model so as to update the TSDF representation of the model.

6. The monocular RGB-D camera real-time face reconstruction device of claim 5, wherein the first processing module is further configured to find a corresponding position of each feature point on the depth image according to the remaining internal feature points, and obtain a three-dimensional coordinate of each feature point of the current frame by back-projection of an internal reference matrix of the depth camera.

7. The monocular RGB-D camera real-time face reconstruction device of claim 5, wherein the obtaining module is further configured to render the currently reconstructed model into its corresponding depth map and obtain current three-dimensional coordinates of feature points on the keyframe model.

8. The monocular RGB-D camera real-time face reconstruction device of claim 5, wherein the second processing module is further configured to model global rigid motion as an optimization problem, the optimization being aimed at: