CN114581571A

CN114581571A - Monocular human body reconstruction method and device based on IMU and forward deformation field

Info

Publication number: CN114581571A
Application number: CN202210207960.XA
Authority: CN
Inventors: 要宇馨; 江博艺
Original assignee: Hangzhou Xiangyan Technology Co ltd
Current assignee: Hangzhou Xiangyan Technology Co ltd
Priority date: 2022-03-04
Filing date: 2022-03-04
Publication date: 2022-06-03
Anticipated expiration: 2042-03-04
Also published as: CN114581571B

Abstract

The invention discloses a monocular human body reconstruction method and device based on an IMU (inertial measurement Unit) and a forward deformation field, which can obtain an accurate and high-quality natural and real rendering result of the geometry and new visual angle of a reconstructed dynamic human body by using a section of human body motion monocular RGB (red, green and blue) video training bound with an inertial sensor, and can drive the reconstructed human body to obtain new actions by using a newly input human body posture. According to the method, firstly, an implicit symbol distance field model for expressing a reference space shape and a nerve radiation field model for expressing colors are established, then a forward skin deformation field is established to obtain the deformation from a current space point to a reference space point corresponding to a frame-by-frame picture, the color value and the transparency of each pixel point in the frame-by-frame picture are obtained, and the difference between the color value and the transparency of each pixel point in the frame-by-frame picture is taken as a main loss function. In order to avoid the self-shielding problem of monocular video, the invention provides relative position information between adjacent frames as another main loss function for training by binding the inertial sensor on the moving human body.

Description

Monocular human body reconstruction method and device based on IMU and forward deformation field

Technical Field

The invention relates to the technical field of human body image processing, in particular to a high-quality monocular RGB (red, green and blue) video human body reconstruction method and device which adopt an inertial sensor, carry out human body deformation by utilizing a forward deformation field based on a human body skeleton parameterized model and carry out body rendering by combining geometric prediction and color prediction.

Background

In recent years, human image capture has become more popular in the production of motion picture games and the like, VR/AR, virtual digital human and the like applications. While the RGB image data shot by the monocular camera is the most common and easily available dynamic human body data form, but due to the limited information, it becomes a difficult problem to reconstruct a high-precision dynamic human body sequence.

In the past, the reconstruction of a dynamic human body is often obtained by means of depth data acquired by a structured light camera, the technologies generally obtain a reconstructed dynamic human body sequence by means of real-time non-rigid tracking and fusion of the depth data, and in order to better utilize human body prior information, a template model is established in advance for a human body model to assist in tracking. Or a dense multi-view video sequence is adopted to acquire the three-dimensional shape information. Recently, models based on sparse view and single view cameras have been proposed, which typically operate in conjunction with an implicit representation of the human body, where an implicit symbolic distance field or field expressing occupancy can express more complex detailed information with less storage than a displayed representation of a grid or point cloud. The technologies generally select a reference space, combine with a neural network to represent a deformation field from frame-by-frame dynamic data to the reference space, render a model by adopting a neural rendering technology, and optimize the reconstruction of a scene by constraining a rendered picture to be as close as possible to an input picture. In order to utilize a priori information of the human body, some methods use a commonly used and very popular parameterized model of the human skeleton (SMPL model) for basic low-dimensional representation of the human body, and use the representation to assist a deformation field to enable the model to handle more complex actions and increase robustness.

However, in the multi-view camera setting, even if the camera is a sparse multi-view camera, calibration and other problems between the cameras are required, so that the use is not convenient enough; the single camera is more convenient to use, but the acquired information is limited and ambiguous. Considering that an inertial sensor (IMU) can provide three-dimensional information such as velocity acceleration and direction of adjacent frames, there is wide reference in attitude estimation of a human body, and it is easy to add the sensor to an AR/VR device or the like. Therefore, we conducted a study of the human reconstruction problem of monocular RGB video in combination with inertial sensors. Therefore, a preprocessing process of multiple cameras is not needed, and the relative three-dimensional position information of adjacent frames provided by the inertial sensor enables the reconstruction system to better process human body sequences with any motion of the occlusion, and the reconstruction system is not limited to motion sequences with small motion amplitude and the like.

In order to model a deformation field of a dynamic human body, a skeleton model parameterized model (SMPL model) of the human body is used as a low-dimensional deformation representation, and because the linear mixed skin weight expression capacity in the parameterized model is limited, only a naked human body can be expressed, a neural network is used for learning the skin weight from any point on the surface of the dressed human body to a joint point. And, we use the deformation from the reference space to be learned to the real-time space of the current frame, forward deformation to express the deformation model. Compared with backward deformation from the current image space to the reference space, forward deformation is easier to learn, deformation models of a plurality of actions are more uniform, and new skeleton deformation can be given to obtain new actions of the dressing human body.

To take advantage of the input image as a supervision, we use a volume rendering technique that combines geometric prediction and color prediction for rendering. A learnable color prediction network and a symbol distance field for expressing geometric information are established for a reference space, the model corresponding to each input picture is deformed to the reference space to be rendered by combining the deformation field to obtain a rendered image, and then the rendered image and the input picture are subjected to pixel-by-pixel similarity measurement to perform joint learning. In this way, the implicit representation of the input picture frame by frame can be learned, and the three-dimensional curved surface can be extracted by predicting the symbolic distance value of the curved surface at random sampling points in the space. The forward deformation field and reference space we have learned can also be used to generate new poses of the reconstructed body, and new motion sequences can be used to drive the virtual body motion.

Disclosure of Invention

The invention aims to provide a monocular human body reconstruction method and device based on an IMU and a forward deformation field, aiming at the defects of the prior art, and the monocular human body reconstruction method and device can reconstruct a moving human body quickly, accurately and in high quality.

The purpose of the invention is realized by the following technical scheme:

according to a first aspect of the present specification, there is provided a monocular human reconstruction method based on an IMU and a forward deformation field, the method comprising:

s1: collecting a human motion monocular RGB video wearing an inertial sensor, segmenting a human body and a background of the human motion monocular RGB video frame by frame, recording a human body position bound by the inertial sensor and a derived acceleration signal, and recording an inertial sensor frame rate and a monocular RGB video shooting frame rate;

s2: fitting the human motion monocular RGB video frame by using a pre-trained human parametric fitting model to obtain the initial estimation of the shape and the posture of the human parametric fitting model frame by frame, marking a sensor label on a standard grid point corresponding to the human parametric fitting model, and indicating whether an inertial sensor and the bound sensor label are bound at the point;

s3: establishing a forward skin deformation field model from a point in a reference space to a point in a current space corresponding to the frame-by-frame human motion picture by adopting the learnable skin weight and combining an initial estimation result obtained by S2;

s4: establishing an implicit symbolic distance field model in a reference space that expresses a reference shape using a neural network;

s5: establishing a nerve radiation field model expressing colors in a reference space by utilizing a neural network;

s6: sampling rays from a current space corresponding to the frame-by-frame human motion picture in a volume rendering mode, and then sampling points along the rays;

s7: deforming the sampling point into a reference space according to the forward skin deformation field, obtaining a sampling point symbol distance value according to the implicit symbol distance field, and obtaining a sampling point color value and transparency according to the nerve radiation field;

s8: obtaining rendering color values corresponding to the frame-by-frame human motion picture according to the color values and the transparency of all sampling points along the ray direction;

s9: finding the closest point in the standard grid points corresponding to the human body parametric fitting model for each sampling point in S6, and transferring the sensor label of the standard grid point corresponding to the human body parametric fitting model to each sampling point; recording sampling points of the bound inertial sensor as key sampling points, firstly deforming the key sampling points into a reference space, then deforming the key sampling points into a current space corresponding to an adjacent frame through S3 to obtain new coordinates, and recording Euclidean distances between original coordinates and the new coordinates of the key sampling points;

s10: training a dynamic human body reconstruction model formed by the skin deformation field, the implicit symbol distance field and the nerve radiation field; obtaining a reconstructed human body according to the trained dynamic human body reconstruction model;

s11: inputting the new human body parameterized fitting model posture into the dynamic human body reconstruction model trained in S10, and generating the new posture of the reconstructed human body.

Further, in S10, the difference between the rendering color value obtained in S8 and the color value of the corresponding point of the human body picture split in S1 is taken as a loss function 1; obtaining a human body contour map according to the rendering color values obtained in the step S8, and taking the difference between the human body contour map obtained according to the human body picture divided in the step S1 and the human body contour map as a loss function 2; obtaining the acceleration of the key sampling point according to the Euclidean distance, the frame rate of the inertial sensor and the monocular RGB video shooting frame rate in S9, and taking the difference between the acceleration and the acceleration derived by the inertial sensor as a loss function 3; the weighted sum of the loss function 1, the loss function 2, and the loss function 3 is taken as the trained loss function.

Further, the forward skin deformation field model D in S3_wThe function of (d) is:

wherein x is_c(r_i,t_j) Representing a point in the reference space, x_d(r_i,t_j) Representing a sampling ray r in the current space corresponding to the frame-by-frame human motion picture_iStep-up, step-length is t_jOf the sampling points of (a) are,

transformation matrix, n, representing bones in a parameterized fitted model of the human body_bThe number of bones; the specific skin deformation formula is as follows:

wherein, w_kIs a learnable skinning weight; from the point x in the current space corresponding to the frame-by-frame human motion picture_d(r_i,t_j) Deformed to point x in reference space_c(r_i,t_j) The root of the skin deformation formula can be solved by using a numerical optimization method such as a Newton method or a quasi-Newton method.

Further, an implicit symbolic distance field f in reference space that expresses a reference shape in S4_sThe function of (d) is:

f_s:x_c(r_i,t_j)→(s_ij,F_ij)

wherein s is_ijA symbol distance value, F, representing the reference shape in the obtained reference space_ijIs a feature associated with the implicit symbolic distance field that establishes a connection in a reference space between the implicit symbolic distance field that expresses the reference shape and the neural radiation field that expresses the color.

Further, the implicit symbolic distance field model that expresses the reference shape in the reference space is a neural network model, which in turn comprises: input layer, nonlinear layer, full connection layer and loss layer.

Further, in S4, a frame is selected from the initial estimation of the shape and the pose of the frame-by-frame human parametric fit model, and the implicit symbolic distance field expressing the reference shape in the reference space is initialized by the standard grid corresponding to the frame.

Further, the nerve radiation field f of the expression color in the reference space in S5_cThe function of (d) is:

f_c:(x_c(r_i,t_j),r_i,F_ij)→c_ij

wherein, c_ijIs x_c(r_i,t_j) Obtaining a sampling ray r according to a discretized volume rendering formula_iCorresponding color value C (r)_i) Comprises the following steps:

wherein n is the number of sampling points on the sampling ray, alpha_ijTransparency for the sample point correspondences:

wherein phi_m(x)＝(1+e^-mx)^-1Is a Sigmoid function, m is a predefined parameter, s_i(j+1)Is to sample the ray r_iStep length of up, t_j+1Sample point x of_d(r_i,t_j+1) Deformed reference space point x_c(r_i,t_j+1) Input implicit symbolic distance field f_sThe resulting symbol distance value.

Further, the neural radiation field model for expressing colors in the reference space is a neural network model, and sequentially comprises: input layer, nonlinear layer, full connection layer and loss layer.

Further, the loss function in S10 may also contain a regularization term that may employ an Eikonal loss function that constrains an implicit symbolic distance field in the reference space that expresses the reference shape.

Further, the initial estimation of the shape and the posture of the frame-by-frame human body parametric fit model obtained in S2 can be optimized jointly with the dynamic human body reconstruction model as a learnable variable in the training process of S10.

According to a second aspect of the present specification, there is provided an apparatus for monocular human body reconstruction based on an IMU and a forward deformation field, comprising a memory having stored therein executable code, and one or more processors for implementing the method for monocular human body reconstruction based on an IMU and a forward deformation field according to the first aspect when executing the executable code.

The invention has the beneficial effects that: 1) through the establishment of an implicit symbolic distance field model expressing a reference shape in a reference space, a nerve radiation field model expressing colors and a volume rendering technology, the method can render to obtain a natural and real human video; 2) acceleration information between two adjacent frames is introduced through IMU equipment to be used as posture estimation of a parameterized model and direct constraint of a deformation field, and the method can more accurately model the deformation field so as to ensure that the reconstructed human body geometry is more accurate and the rendering effect is more natural and real; 3) because the forward deformation field can deform the reference space point of the human body model to the current space according to the attitude parameters of the parameterized model, the method can input new attitude parameters to drive the modeled human body to deform to a new attitude.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a monocular human reconstruction method based on an IMU and a forward deformation field according to an exemplary embodiment;

FIG. 2 is a schematic diagram illustrating an implementation principle of a monocular human body reconstruction method based on an IMU and a forward deformation field according to an exemplary embodiment;

fig. 3 is a block diagram of a monocular human body reconstruction method and apparatus based on an IMU and a forward deformation field according to an exemplary embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The human body modeling based on the monocular video has the problems that high-precision modeling is difficult to achieve due to the fact that information of a single visual angle is limited and a human body has dynamic deformation, and particularly when the motion change of the human body is large. Moreover, when the condition of human body self-occlusion exists, the monocular video has information loss, the inertial sensor is different from an optical camera and is not influenced by the occlusion, and the provided information is three-dimensional and is not two-dimensional information similar to a picture. Therefore, an embodiment of the present invention provides a monocular human body reconstruction method based on an IMU and a forward deformation field, as shown in fig. 1 and 2, which mainly includes the following steps:

s1: collecting a human motion monocular RGB video wearing an inertial sensor, segmenting a human body and a background of the human motion monocular RGB video frame by frame, recording a human body position bound by the inertial sensor and a derived acceleration signal, and recording an inertial sensor frame rate and a monocular RGB video shooting frame rate; the invention adopts a data set which is really acquired, uses 6 inertial sensors which are respectively bound on a wrist, an ankle, a head and a waist, and uses a monocular RGB camera to shoot a video of human motion.

S2: and fitting the human motion monocular RGB video frame by using a pre-trained human parametric fitting model to obtain the initial estimation of the shape and the posture of the human parametric fitting model frame by frame, marking a sensor label on a standard grid point corresponding to the human parametric fitting model, and indicating whether the point binds an inertial sensor and a bound sensor label.

S3: establishing a forward skin deformation field model from a point in a reference space to a point in a current space corresponding to the frame-by-frame human motion picture by adopting the learnable skin weight and combining an initial estimation result obtained by S2; wherein D of the forward skin deformation field model_wThe function is:

wherein x is_c(r_i,t_j) Representing a point in the reference space, x_d(r_i,t_j) Representing a sampling ray r in the current space corresponding to the frame-by-frame human motion picture_iStep length of up, t_jOf the sampling points of (a) are,

S4: establishing an implicit symbolic distance field model expressing a reference shape in a reference space by using a neural network; implicit symbolic distance field f_sThe function of (d) is:

f_s:x_c(r_i,t_j)→(s_ij,F_ij)

wherein s is_ijRepresenting the resulting expression in reference spaceSymbolic distance value of reference shape, F_ijIs a feature associated with the implicit symbolic distance field for establishing a connection in a reference space between the implicit symbolic distance field that expresses a reference shape and the neural radiation field that expresses a color;

specifically, an implicit symbolic distance field model that expresses a reference shape in a reference space employs a neural network model, which in turn comprises: input layer, nonlinear layer, full connection layer and loss layer. One frame can be selected from the initial estimation of the shape and the posture of the human body parametric fitting model frame by frame, and an implicit symbolic distance field expressing a reference shape in a reference space is initialized by using a standard grid corresponding to the frame.

S5: establishing a nerve radiation field model expressing colors in a reference space by utilizing a neural network; nerve radiation field f_cThe function of (d) is:

f_c:(x_c(r_i,t_j),r_i,F_ij)→c_ij

wherein, c_ijIs x_c(r_i,t_j) According to the discretized volume rendering formula, obtaining a sampling ray r_iCorresponding color value C (r)_i) Comprises the following steps:

wherein n is the number of sampling points on the sampling ray. Alpha is alpha_ijTransparency for the sample point correspondences:

wherein phi_m(x)＝(1+e^-mx)^-1Is a Sigmoid function, m is a predefined parameter, s_i(j+1)Is to sample the ray r_iStep length of up, t_j+1Sample point x of_d(r_i,t_j+1) Deformed reference space point x_c(r_i,t_j+1) Inputting implicit symbolic distance fieldsf_sObtaining a symbol distance value;

specifically, the neural radiation field model for expressing colors in the reference space adopts a neural network model, and sequentially comprises the following steps: input layer, nonlinear layer, full connection layer and loss layer.

S6: and sampling rays from a current space corresponding to the frame-by-frame human motion picture by adopting a volume rendering mode, and then sampling points along the rays.

S7: and deforming the sampling point into a reference space according to the forward skin deformation field, obtaining a sampling point symbol distance value according to the implicit symbol distance field, and obtaining a sampling point color value and transparency according to the nerve radiation field.

S8: and obtaining rendering color values corresponding to the frame-by-frame human motion picture according to the color values and the transparencies of all the sampling points along the ray direction.

S9: finding the closest point in the standard grid points corresponding to the human body parametric fitting model for each sampling point in S6, and transferring the sensor label of the standard grid point corresponding to the human body parametric fitting model to each sampling point; and marking the sampling points of the bound inertial sensor as key sampling points, firstly deforming the key sampling points into a reference space, then deforming the key sampling points into a current space corresponding to an adjacent frame through S3 to obtain new coordinates, and recording Euclidean distances between original coordinates and the new coordinates of the key sampling points.

S10: training a dynamic human body reconstruction model formed by the skin deformation field, the implicit symbol distance field and the nerve radiation field;

taking the difference between the rendering color value obtained in the step S8 and the color value of the corresponding point of the human body picture segmented in the step S1 as a loss function 1; obtaining a human body contour map according to the rendering color values obtained in the step S8, and taking the difference between the human body contour map obtained according to the human body picture divided in the step S1 and the human body contour map as a loss function 2; obtaining the acceleration of the key sampling point according to the Euclidean distance, the frame rate of the inertial sensor and the monocular RGB video shooting frame rate in S9, and taking the difference between the acceleration and the acceleration derived by the inertial sensor as a loss function 3; taking the weighted sum of the loss function 1, the loss function 2 and the loss function 3 as a trained loss function; in addition, the loss function can also contain a regularization term, which can employ an Eikonal loss function that constrains an implicit symbolic distance field in the reference space that expresses the reference shape; obtaining a reconstructed human body according to the trained dynamic human body reconstruction model;

further, the initial estimation of the shape and the posture of the frame-by-frame human body parametric fit model obtained in S2 can be used as a learnable variable in the training process of S10 to be optimized in combination with the dynamic human body reconstruction model.

Corresponding to the embodiment of the monocular human body reconstruction method based on the IMU and the forward deformation field, the invention also provides an embodiment of a monocular human body reconstruction device based on the IMU and the forward deformation field.

Referring to fig. 3, the monocular human body reconstructing device based on the IMU and the forward deformation field according to the embodiment of the present invention includes a memory and one or more processors, where the memory stores executable codes, and the processors execute the executable codes to implement the monocular human body reconstructing method based on the IMU and the forward deformation field according to the embodiment.

The embodiment of the monocular human body reconstruction device based on the IMU and the forward deformation field of the present invention can be applied to any device with data processing capability, such as a computer or other device. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. From a hardware aspect, as shown in fig. 3, the present invention is a hardware structure diagram of any device with data processing capability in which a monocular human body reconstruction device based on an IMU and a forward deformation field is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 3, any device with data processing capability in which the device is located in the embodiment may also include other hardware according to the actual function of the any device with data processing capability, which is not described again.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

An embodiment of the present invention further provides a computer-readable storage medium, on which a program is stored, and when the program is executed by a processor, the monocular human body reconstruction method based on an IMU and a forward deformation field in the foregoing embodiments is implemented.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be any external storage device of a device with data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments herein. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims

1. A monocular human body reconstruction method based on an IMU and a forward deformation field is characterized by comprising the following steps:

s1: collecting a human motion monocular RGB video wearing an inertial sensor, segmenting a human body and a background of the human motion monocular RGB video frame by frame, and recording the position of the human body bound by the inertial sensor;

s2: fitting the human motion monocular RGB video frame by using a pre-trained human parametric fitting model to obtain the initial estimation of the shape and the posture of the human parametric fitting model frame by frame, and marking a sensor label on a standard grid point corresponding to the human parametric fitting model;

s4: establishing an implicit symbolic distance field model that expresses a reference shape in a reference space;

s5: establishing a nerve radiation field model for expressing colors in a reference space;

s9: finding the closest point in the standard grid points corresponding to the human body parameterized fitting model for each sampling point in S6, and transferring the sensor labels of the standard grid points to the sampling points; recording sampling points of the bound inertial sensor as key sampling points, firstly deforming to a reference space, then deforming to a current space corresponding to an adjacent frame through S3 to obtain new coordinates, and recording Euclidean distances between original coordinates and the new coordinates of the key sampling points;

s10: training a dynamic human body reconstruction model formed by the skin deformation field, the implicit symbol distance field and the nerve radiation field, and obtaining a reconstructed human body according to the trained dynamic human body reconstruction model;

2. The monocular human body reconstruction method based on the IMU and the forward deformation field of claim 1, wherein the forward skin deformation field model D in S3_wThe function of (d) is:

wherein x is_c(r_i,t_j) Representing points in reference space, x_d(r_i,t_j) Representing a sampling ray r in the current space corresponding to the frame-by-frame human motion picture_iStep length of up, t_jOf the sampling points of (a) are,

transformation matrix, n, representing bones in a parameterized fitted model of the human body_bThe number of bones; the skin deformation formula is:

wherein, w_kIs a learnable skinning weight.

3. The method of claim 2, wherein the implicit symbolic distance field f is the distance field of S4_sThe function of (d) is:

f_s:x_c(r_i,t_j)→(s_ij,F_ij)

4. The method of claim 1, wherein the step S4 is implemented by selecting a frame from initial estimates of the shape and pose of the frame-by-frame parametric fit model of the human body, and initializing an implicit symbolic distance field in a reference space that expresses a reference shape with a standard grid corresponding to the frame.

5. The method of claim 3, wherein the nerve radiation field f in S5 is a single-eye human body reconstruction method based on IMU and forward deformation field_cThe function of (d) is:

f_c:(x_c(r_i,t_j),r_i,F_ij)→c_ij

6. The method of claim 1, wherein the implicit symbolic distance field model is a neural network model that, in turn, comprises: the device comprises an input layer, a nonlinear layer, a full connection layer and a loss layer;

the nerve radiation field model is a neural network model and sequentially comprises: input layer, nonlinear layer, full connection layer and loss layer.

7. The method for monocular human body reconstruction based on IMU and forward deformation field according to any one of claims 1 to 6, wherein in S10, the difference between the rendering color value obtained in S8 and the color value of the corresponding point of the human body picture segmented in S1 is used as a loss function 1;

obtaining a human body contour map according to the rendering color values obtained in the step S8, and taking the difference between the human body contour map obtained according to the human body picture divided in the step S1 and the human body contour map as a loss function 2;

obtaining the acceleration of a key sampling point according to the Euclidean distance, the frame rate of the inertial sensor and the video shooting frame rate in S9, and taking the difference between the acceleration and the acceleration derived by the inertial sensor as a loss function 3;

the weighted sum of the three loss functions is taken as the trained loss function.

8. The method of claim 1, wherein the loss function in S10 further comprises an Eikonal loss function that constrains an implicit symbolic distance field in the reference space that represents the reference shape.

9. The IMU and forward deformation field-based monocular human reconstruction method of claim 1, wherein the initial estimation of the shape and pose of the frame-by-frame human parametric fit model obtained in S2 is jointly optimized with the dynamic human reconstruction model as a learnable variable in the training process of S10.

10. An apparatus for IMU and forward deformation field based monocular human reconstruction, comprising a memory having stored therein executable code and one or more processors, wherein the processors, when executing the executable code, are configured to implement the IMU and forward deformation field based monocular human reconstruction method according to any one of claims 1-9.