CN116452715A

CN116452715A - Dynamic human hand rendering method, device and storage medium

Info

Publication number: CN116452715A
Application number: CN202310256394.6A
Authority: CN
Inventors: 陈庆; 石武; 乔宇
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2023-03-16
Filing date: 2023-03-16
Publication date: 2023-07-18

Abstract

The embodiment of the invention discloses a dynamic human hand rendering method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a human hand image sequence; transforming the sampling points of the hand image in the observation space from the observation space to the standard space through motion field distribution based on posture dependence to obtain sampling point coordinates of the sampling points in the standard space; according to sampling point coordinates of sampling points in the standard space, learning an identity-dependent neural radiation field in the standard space to obtain the volume color and the volume density of the sampling points; and rendering the volume color and the volume density of the sampling points to obtain a rendering result of the human hand, wherein the rendering result comprises a rendering image of the human hand and/or a three-dimensional model of the human hand. The invention solves the problem that the dynamic human hand rendering result in the related technology is not lifelike enough.

Description

Dynamic human hand rendering method, device and storage medium

Technical Field

The present invention relates to the field of three-dimensional reconstruction, and in particular, to a method and apparatus for dynamic hand rendering, and a storage medium.

Background

Rendering techniques for dynamic human hands have been a hotspot for computer vision and computer graphics research. The related technology has important application in a plurality of fields such as man-machine interaction, virtual reality, mixed reality, holographic communication and the like.

In the related technology, the three-dimensional hand model and the two-dimensional hand image obtained by the traditional rendering method generally adopt texture mapping and color grids to conduct hand texture rendering and geometric modeling, but expensive scanning data and professional knowledge are generally required for developing the well-designed personalized color grids and texture mapping, and the limited texture mapping and color grids are difficult to ensure that the dynamic hand rendering result is lifelike; meanwhile, due to the highly hinged structure of the human hand, the movement of the hinged hand can cause serious self-shielding, and different hand gestures can cause obvious changes of illumination and shadow modes, so that the obtained dynamic human hand rendering result is not lifelike.

Therefore, there is an urgent need for a dynamic human hand rendering method capable of improving the fidelity of the dynamic human hand rendering result.

Disclosure of Invention

The embodiments of the invention provide a dynamic human hand rendering method, a device, electronic equipment and a storage medium, which are used for solving the problem that the dynamic human hand rendering result in the related technology is not realistic.

In order to solve the technical problems, the invention adopts the following technical scheme:

according to one aspect of the invention, a method of dynamic human hand rendering, the method comprising: acquiring a human hand image sequence, wherein the human hand image sequence comprises a plurality of frames of human hand images, and each frame of human hand image is obtained by shooting different visual angles of human hands; transforming the sampling points of the hand image in the observation space from the observation space to the standard space through motion field distribution based on posture dependence to obtain sampling point coordinates of the sampling points in the standard space; according to sampling point coordinates of sampling points in the standard space, learning an identity-dependent neural radiation field in the standard space to obtain the volume color and the volume density of the sampling points; and rendering the volume color and the volume density of the sampling points to obtain a rendering result of the human hand, wherein the rendering result comprises a rendering image of the human hand and/or a three-dimensional model of the human hand.

In an exemplary embodiment, transforming the sampling points of the human hand image in the observation space from the observation space to the standard space through motion field distribution based on posture dependence, and obtaining sampling point coordinates of the sampling points in the standard space comprises estimating the posture and the shape of each frame of human hand image through a human hand parameterization model to obtain posture parameters and shape parameters of each frame of human hand image; learning to obtain motion field distribution based on gesture dependence through guidance of the gesture parameters and the shape parameters in an observation space; and respectively carrying out rigid transformation and non-rigid transformation offset compensation on sampling points of the hand image in the observation space by using motion field distribution based on gesture dependence to obtain sampling point coordinates of the sampling points in the standard space.

In an exemplary embodiment, using the gesture-dependent motion field distribution to perform rigid transformation on sampling points of the hand image in the observation space respectively includes searching for a plurality of vertices closest to the sampling points in each vertex of the hand parameterized model based on the sampling points of the hand image in the observation space, and taking the distances between the searched vertices and the sampling points as weights; carrying out weighted summation on a preset rendering transformation matrix based on the weight to obtain a rigid transformation matrix of the sampling point transformed from the observation space to the standard space; and carrying out rigid transformation on the sampling points according to the rigid transformation matrix.

In an exemplary embodiment, rendering the volume color and the volume density of the sampling point to obtain a rendering result of the human hand includes volume rendering the volume color and the volume density of the sampling point to obtain a rendering image of the human hand, and/or neural rendering the volume density of the sampling point to obtain a three-dimensional model of the human hand; the method comprises the steps of performing nerve rendering on the volume density of the sampling points, and obtaining a three-dimensional model of the human hand comprises extracting a three-dimensional grid model from the nerve radiation field through a matching cube algorithm; and performing differential rendering on the volume density of the sampling points based on the three-dimensional grid model to obtain the three-dimensional model of the human hand.

In an exemplary embodiment, transforming the sampling points of the human hand image in the observation space from the observation space to the standard space by the gesture-dependent motion field distribution, and obtaining the sampling point coordinates of the sampling points in the standard space includes sampling the light rays according to a set step length under the condition that the rendering camera emits the light rays from the pixel plane of the human hand image to the observation space, so as to obtain the density distribution of the light rays; and densely sampling the light rays with high density distribution, sparsely sampling the light rays with low density distribution, and obtaining a plurality of sampling points.

In an exemplary embodiment, sampling the light according to a set step length, and searching sampling points with a distance from the vertex in the corresponding human hand parameterized model greater than a set threshold value after obtaining the density distribution of the light; and setting the density of the searched sampling points as a set density, and updating the density distribution of the light rays based on the sampling points after the density setting is completed.

In one exemplary embodiment, dynamic human hand rendering is accomplished by invoking a dynamic human hand rendering model, which is a trained machine learning model having dynamic human hand rendering capabilities for the human hand image.

In one exemplary embodiment, a training process for a dynamic human hand rendering model includes obtaining a training set; inputting training images in the training set into the machine learning model for training to obtain a rendering result generated in the training process and a loss value between the training images; if the loss value meets the convergence condition, training is completed, and the dynamic human hand rendering model is obtained; otherwise, updating model parameters of the machine learning model, acquiring other training images in a training set, inputting the other training images into the machine learning model, and continuing training until the loss value meets the convergence condition.

According to one aspect of the invention, a dynamic human hand rendering device comprises an image sequence acquisition module, a dynamic human hand rendering module and a dynamic human hand rendering module, wherein the image sequence acquisition module is used for acquiring a human hand image sequence, the human hand image sequence comprises a plurality of frames of human hand images, and each frame of human hand image is obtained by shooting a human hand at different visual angles; the space transformation module is used for transforming the sampling points of the hand image in the observation space from the observation space to the standard space through motion field distribution based on posture dependence to obtain sampling point coordinates of the sampling points in the standard space; the nerve radiation field learning module is used for learning the nerve radiation field based on identity dependence in the standard space according to the sampling point coordinates of the sampling points in the standard space to obtain the volume color and the volume density of the sampling points; and the human hand rendering module is used for rendering the volume color and the volume density of the sampling points to obtain a rendering result of the human hand, wherein the rendering result comprises a rendering image of the human hand and/or a three-dimensional model of the human hand.

According to one aspect of the invention, an electronic device includes a processor and a memory having stored thereon computer readable instructions that when executed by the processor implement a dynamic human hand rendering method as described above.

According to one aspect of the invention, a storage medium has stored thereon a computer program which, when executed by a processor, implements a dynamic human hand rendering method as described above.

In the technical scheme, the invention solves the problems that the related technology cannot dynamically represent the changes of different postures and shapes of hands, cannot control the postures of the hands and can render the shapes with high fidelity.

Specifically, a human hand image sequence is firstly obtained, the human hand image sequence comprises a plurality of frames of human hand images, each frame of human hand image is obtained by shooting different visual angles of a human hand, sampling points of the human hand image in an observation space are transformed from the observation space to a standard space through motion field distribution based on posture dependence, sampling point coordinates in the standard space are obtained, a nerve radiation field in the standard space is learned according to the sampling point coordinates in the standard space, the volume color and the volume density of the sampling points are obtained, the volume color and the volume density of the sampling points are subjected to volume rendering, a rendering image of the human hand is obtained, and the volume density of the sampling points is subjected to differential rendering, so that a three-dimensional model of the human hand is obtained. In the process, the human hand is inversely transformed from the observation space to the standard space with the gesture dependency based on the gesture dependency based motion field distribution and the nerve radiation field in the multi-view scene, the nerve radiation field is enabled to learn the solvable hidden function representation under the standard space, the rendering of the texture high-frequency details is realized through the solvable hidden function representation, the more accurate texture reconstruction effect is obtained, the nerve radiation field is rendered through the volume rendering method, the human hand synthesized image with any view angle is obtained, the effect is obvious in the aspect of rendering of the new view, and the reality degree of the dynamic human hand rendering result can be obviously improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic illustration of an implementation environment in accordance with the teachings of the present application;

FIG. 2 is a flow chart illustrating a method of dynamic human hand rendering according to an exemplary embodiment;

FIG. 3 is a schematic diagram of a non-rigid transformation network in accordance with the embodiment of FIG. 2;

FIG. 4 is a schematic diagram of a dynamic human hand rendering method in the corresponding embodiment of FIG. 2;

FIG. 5 is a schematic diagram of parameter estimation in step 130 in the corresponding embodiment of FIG. 2;

FIG. 6 is a schematic diagram of a network structure of the neural radiation field in the corresponding embodiment of FIG. 2;

FIG. 7 is a schematic diagram illustrating the effect of different standard pose settings in a standard space on dynamic human hand rendering, according to an example embodiment;

FIG. 8 is a schematic diagram of a three-dimensional reconstruction model of a human hand at step 170 in the corresponding embodiment of FIG. 2;

FIG. 9 is a schematic diagram of the result of dynamic human hand rendering of the corresponding embodiment of FIG. 2;

FIG. 10 is a block diagram of a dynamic human hand rendering device shown according to an example embodiment;

FIG. 11 is a hardware configuration diagram of an electronic device shown according to an exemplary embodiment;

fig. 12 is a block diagram of an electronic device, according to an example embodiment.

There has been shown in the drawings, and will hereinafter be described, specific embodiments of the invention with the understanding that the present disclosure is to be considered in all respects as illustrative, and not restrictive, the scope of the inventive concepts being indicated by the appended claims.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of illustrating the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.

The following is an introduction and explanation of several terms involved in this application:

human hand parameterized model: the human hand can be understood as a basic model and the sum of deformation performed on the basis of the model, PCA is performed on the basis of the deformation to obtain a shape parameter (shape) which is a low-dimensional parameter describing the shape, meanwhile, a motion tree is used for representing the gesture of the human hand, namely the rotation relation of each joint point and a father node of the motion tree, the relation can be expressed as a three-dimensional vector, and finally the local rotation vector of each joint point forms a gesture parameter (ose) of the human hand parameterized model. The model is obtained through training, wherein the posture parameters have 48 parameters which represent the rotation angles of 16 joint points, the shape parameters have 10 parameters which represent the ratio of the length to the thickness of the fingers of the human body, and in general, the reasonable combination of the parameters can control the change of the shape of the human body.

Neural radiation field: typically in conjunction with a deep learning framework. The micro-renderable in the nerve radiation field can enable the whole rendering process to be differentiable, so that gradients can be reversely transferred, deep learning network parameters can be continuously updated, the end-to-end network model can be constructed, meanwhile, for an input image of a given visual angle, two-dimensional semantic information obtained by micro-renderable on three-dimensional geometry can be formed, and an input image corresponding to the visual angle can form a self-circulation supervision network without additional expensive supervision.

Currently, neural radiation fields are widely used in representing dynamic human body scenes to predict geometric and texture mapping of the human body, however, exploration of hand representations based on neural radiation fields either fails to perform hand rendering or produces blurred pixels in hand regions, resulting in human hands that are not realistic enough.

The first reason is that the human hand is highly articulated, complex hand movements present difficulties in rendering, firstly, deformation of hand geometry is difficult to model, and when dealing with large and complex human hand deformations (e.g., self-contact), accurate human hand pose and shape parameters are difficult to obtain; second, hand texture is difficult to model due to the highly articulated structure of the human hand, for example, articulated hand motion can result in severe self-occlusion, and thus different hand poses can result in significant changes in illumination and shadow patterns, resulting in less realistic rendering results.

The parameterized model of human hand is usually used to represent the variation of different gestures and shapes of human hand, however, the parameterized model of human hand can only represent 778 vertices and 1538 patches, the expression capability is extremely limited, and the grid has the defects of discontinuity and unchangeable topology structure, while other prior art cannot deal with self-contact, the problem of ambiguity is easy to cause in space conversion, and end-to-end network training is damaged, which brings great problem to accurate human hand gesture rendering.

From the above, the related art still has the defect that the dynamic human hand rendering result is not realistic enough.

According to the dynamic hand rendering method, a hand image sequence is firstly obtained, the hand image sequence comprises multiple frames of hand images, each frame of hand image is obtained by shooting different visual angles of hands, based on gesture-dependent motion field distribution, sampling points of the hand images in an observation space are transformed into a standard space from the observation space to obtain sampling point coordinates in the standard space, a nerve radiation field in the standard space is learned according to the sampling point coordinates in the standard space to obtain volume colors and volume densities of the sampling points, the volume colors and the volume densities of the sampling points are subjected to volume rendering to obtain a rendered image of the hands, and the volume densities of the sampling points are subjected to differential rendering to obtain a three-dimensional model of the hands. According to the dynamic human hand rendering method, the motion field distribution based on gesture dependency and the nerve radiation field are combined in the multi-view scene, the human hand is reversely transformed from the observation space to the standard space based on gesture dependency, the nerve radiation field is enabled to learn the solvable hidden function representation in the standard space, the rendering of texture high-frequency details is achieved through the solvable hidden function representation, a more accurate texture reconstruction effect is obtained, the nerve radiation field is rendered through the volume rendering method, therefore, a human hand synthesized image with any view angle is obtained, the effect is obvious in the aspect of rendering of a new view, and the reality degree of a dynamic human hand rendering result can be remarkably improved. The dynamic human hand rendering method is suitable for a dynamic human hand rendering device which can be deployed on an electronic device, wherein the electronic device can be a computer device with a von neumann architecture, for example, the computer device can be a desktop computer, a notebook computer, a server and the like, and the electronic device can also be a device with an image acquisition function.

FIG. 1 is a schematic diagram of an implementation environment of a dynamic human hand rendering method. The implementation environment includes an acquisition side 110 and a server side 130.

Specifically, the collection end 110 performs collection of a hand image, and the collection end 110 may be an electronic device such as a smart phone, a tablet computer, a notebook computer, a desktop computer, or other devices with an image collection function (such as an intelligent camera), which is not limited herein.

The collection end 110 and the server end 130 can be connected through communication established in a wired or wireless mode, so that data transmission between the collection end and the server end is achieved. For example, the data transmitted may be an image of a human hand or the like.

The server 130 may also be considered as a cloud, a cloud platform, a server, etc., where the server 130 may be a server, a server cluster formed by a plurality of servers, or a cloud computing center formed by a plurality of servers, so as to better provide a background service to the volume acquisition end 110. For example, the background services include dynamic human hand rendering services.

With the interaction between the acquisition end 110 and the server end 130, in an application scenario, taking the server end 130 to provide the dynamic hand rendering service as an example, after the acquisition end 110 acquires the hand image, the hand image is sent to the server end 130, and then the server end 130 can receive the hand image sent by the acquisition end 110, so as to provide the dynamic hand rendering service based on the hand image. Specifically, after the server 130 obtains the hand image, based on the motion field distribution depending on the gesture, the sampling points of the hand image in the observation space are transformed from the observation space to the standard space to obtain sampling point coordinates in the standard space, the neural radiation field in the standard space is learned according to the sampling point coordinates in the standard space to obtain the volume color and the volume density of the sampling points, the volume color and the volume density of the sampling points are subjected to volume rendering to obtain a rendered image of the hand, and the volume density of the sampling points is subjected to differential rendering to obtain the three-dimensional model of the hand.

Of course, in another application scenario, the acquisition end 110 may also implement the acquisition of the hand image and the dynamic hand rendering at the same time, which is not limited herein.

Referring to fig. 2, an embodiment of the present application provides a dynamic human hand rendering method, which is suitable for an electronic device, for example, the electronic device may be a desktop computer, a notebook computer, a server, or the like, or may be an electronic device with an image capturing function.

In the following method embodiments, for convenience of description, the execution subject of each step of the method is described as an electronic device, but this configuration is not particularly limited.

As shown in fig. 2, the method may include the steps of:

step 110, a human hand image sequence is obtained, wherein the human hand image sequence comprises a plurality of frames of human hand images, and each frame of human hand image is obtained by shooting different visual angles of human hands.

The hand image can be obtained by shooting and collecting through the collecting end. The capturing end may be an electronic device with an image capturing function, for example, a camera, a smart phone configured with a camera, and the like. It can be understood that the shooting can be single shooting, so that one photo can be obtained, or multiple shooting can be performed, so that a video can be obtained, that is, the hand image can be any frame of picture in the video, and can also be one photo.

Regarding the acquisition of the image, the image may be derived from an image captured and acquired by the acquisition end in real time, or may be an image captured and acquired by the acquisition end in a historical period of time stored in the electronic device in advance. Then, for the electronic device, after the image is captured and acquired at the acquisition end, the image may be processed in real time, or reprocessing may be stored in advance, for example, the image is processed when the CPU of the electronic device is low, or the image is processed according to an instruction of a worker. Thus, the dynamic human hand rendering in this embodiment may be performed on images acquired in real time, or may be performed on images acquired in a historical period, which is not particularly limited herein.

And 130, transforming the sampling points of the hand image in the observation space from the observation space to the standard space based on the gesture-dependent motion field distribution, and obtaining the coordinates of the sampling points in the standard space.

The motion field distribution based on posture dependence refers to rigid transformation caused by skeleton motion and non-rigid transformation offset compensation superimposed on the rigid transformation, and the embodiment of the invention decomposes the problem of complex human hand motion into two parts of rigid deformation and non-rigid deformation generated by skeleton motion, so that the rigid deformation caused by skeleton motion refers to posture change generated by human hand motion, such as rigid deformation caused by hand bending, curling and the like, and the non-rigid deformation refers to shape and texture change caused by rigid deformation, such as folds and the like generated by the rigid deformation of human hand.

Through the process, the embodiment of the invention obtains the motion field distribution which can accurately express the posture dependence of the complex human hand motion, thereby being capable of accurately realizing the transformation of the sampling points of the human hand image in the observation space from the observation space to the standard space, establishing the accurate mapping of the observation space and the standard space, reducing the dependence of the dynamic human hand rendering on different input postures and improving the fidelity of the dynamic human hand rendering result.

In one possible implementation, as shown in fig. 3, a network structure of a non-rigid transformation MLP module in a dynamic human hand rendering model is illustrated, and a 7-layer MLP (with a width of 128) is used to input coordinates of sampling points of a human hand image in an observation space and posture parameters of a corresponding human hand parameterization model, and output non-rigid offset compensation. The non-rigid transformation MLP module with the network structure can be combined with the multi-layer perceptron MLP to efficiently and accurately obtain corresponding non-rigid transformation offset compensation according to the gesture change, so that accurate gesture-dependent motion field distribution is efficiently obtained to realize space conversion.

Fig. 4 shows a schematic diagram of a dynamic human hand rendering method, it being seen that the pose-dependent motion field distribution includes skeletal motion stiffness variations and non-stiffness transformations implemented by the multi-layer perceptron MLP.

The spatial transformation process in an embodiment of the present invention is described in detail below with reference to fig. 4:

in one possible implementation, step 130 may be implemented by:

step S1, estimating the posture and shape of each frame of human hand image through a human hand parameterized model to obtain the posture parameters and shape parameters of each frame of human hand image in an observation space.

In one possible implementation manner, before a sampling point of a human hand image in an observation space is transformed from the observation space to a standard space, the embodiment of the invention firstly carries out preprocessing on an obtained human hand image sequence, as shown in fig. 5, firstly extracts a pre-Jing Renshou mask for each frame of human hand image to remove the background, and then obtains human hand posture parameters and shape parameters of each frame of human hand image through a pre-trained human hand parameterized human hand model estimation model, wherein a foreground human hand mask extraction algorithm and a human hand parameterized model are not limited.

Through the process, the embodiment of the invention takes the explicit gesture parameters and shape parameters of the human hand parameterized model as explicit driving parameters, encodes the gesture into non-rigid deformation depending on the gesture, encodes the shape into a nerve radiation field, and learns the human hand data of different shapes under a unified frame, thereby realizing the dynamic human hand rendering of gesture and morphological decoupling.

And S2, learning to obtain a gesture-dependent motion field distribution according to the gesture parameters and the shape parameters, wherein the gesture-dependent motion field distribution comprises rigid transformation and non-rigid transformation.

Specifically, the formula of the gesture-dependent motion field distribution is as follows:

X _c ＝T _rigid (X _o ，θ)+ΔX。

wherein θ is the hand gesture parameter corresponding to the frame, X _o To observe the coordinates of the sampling points of the light in space, X _c To map to the coordinates of the light sampling points in the standard space, T _rigid For the pose-dependent rigid transformation function, Δx is the corresponding non-rigid transformation offset compensation.

And step S3, obtaining the rigidly transformed sampling point coordinates based on the hand parameterized model according to the sampling point coordinates and the gesture and shape parameters of the hand image in the observation space.

In one possible implementation manner, the rigid transformation in the embodiment of the invention is implemented by firstly searching a plurality of vertexes closest to the sampling point coordinate in the observation space based on the sampling point coordinate of the human hand image, taking the distances between the plurality of vertexes and the sampling point as weights, then carrying out weighted summation on a rendering transformation matrix preset by the human hand parameterization model based on the weights to obtain a rigid transformation matrix for transforming the sampling point from the observation space to the standard space, and finally carrying out rigid transformation on the sampling point coordinate according to the rigid transformation matrix to obtain the sampling point coordinate after the rigid transformation.

The searching for the plurality of vertices closest to the coordinates of the sampling point in each vertex of the parameterized model of human hand may be implemented by a nearest neighbor algorithm, for example, a KNN nearest neighbor algorithm, a K nearest neighbor algorithm, and the like, which is not limited herein.

And S4, performing feature synthesis according to the rigidly transformed sampling point coordinates and the gesture parameters to realize non-rigid transformation, and obtaining sampling point coordinates in a standard space.

Specifically, the formula for the non-rigid deformation is as follows:

ΔX＝T _non-rigid (T _rigid (X _o ，θ))。

wherein θ is the hand gesture parameter corresponding to the frame, X _o To observe the coordinates of the sampling point of the space ray, T _non-rigid As a non-rigid transformation function, Δx is the corresponding non-rigid transformation offset compensation.

In the process, the embodiment of the invention fully utilizes the geometric prior of the three-dimensional model of the human hand by modeling the deformation of the human hand as the rigid deformation caused by the skeletal motion and the non-rigid offset compensation superimposed on the rigid deformation, simplifies the learning difficulty of the neural network, enables the articulated human hand parameterized model to realize the explicit conversion of space points, namely from an observation space to a standard space, represents the dynamic human hand as the appearance volume hidden function in one standard space to be transformed into the observation space to generate the appearance volume hidden function, learns one motion field distribution which generates gesture dependence under the guidance of the gesture parameters and the shape parameters of the human hand in the observation space, inversely maps the sampling points in the observation space to the accurate positions in the standard space, and is beneficial to learning the appointed meaningful standard space and reducing the dependence on different input gestures so as to generalize to the rendering gesture, thereby realizing the learning of the nerve radiation field space from the dynamic human hand scene and realizing the rendering after training.

And step 150, learning the neural radiation field based on identity dependence in the standard space according to the sampling point coordinates of the sampling points in the standard space, and obtaining the volume color and the volume density of the sampling points.

The identity-dependent nerve radiation field is characterized in that the identity-dependent nerve radiation field is expanded by introducing shape deformation to render and express, shape parameters are used for expressing the identity, the identity-dependent nerve radiation field has the capability of modeling high-frequency textures, has better performance on viewing angle consistency, can perform photo-realistic dynamic hand rendering, and reconstruct a hand three-dimensional model with high-quality details.

Referring back to fig. 4, after obtaining the coordinates of the sampling points in the standard space, the identity-dependent neural radiation field is entered to obtain the volume color and the volume density of the sampling points.

In one possible implementation, as shown in fig. 6, a network structure of an identity-dependent neural radiation field in a dynamic human hand rendering model is illustrated, and 8 layers of MLPs are used to input coordinates of sampling points in a standard space and shape parameters of a human hand parameterization model, and output volume colors and volume densities of the sampling points.

Under the action of the embodiment, the invention optimally constructs a non-rigid transformation layer neural network and a drivable neural radiation field from a section of input multi-view video, and the constructed non-rigid transformation MLP module and identity-dependent neural radiation field can be used for rendering dynamic hands, thereby remarkably improving the fidelity degree of the rendering result of the dynamic hands.

The specific calculation formula is as follows:

α _i ＝1-exp(-σ(x _i )Δt _i )，。

where D is the sampling step size, Δt _i Is the distance between sampling points i and i+1, C is the volume color of the sampling point, and σ is the volume density of the sampling point.

It is worth mentioning that, as shown in fig. 7, the influence of the setting of different standard attitudes on the rendering result of the dynamic human hand in the standard space is discussed through the qualitative experiment, it can be seen that the setting of the standard attitudes of the attitudes 3 is better than the rendering result of the attitudes 1 and 2 because the different knuckle pitches of the human hand of the attitudes 3 are larger, when the sampling points are inversely converted back to the standard space, the accurate conversion points are easier to obtain, so that the semantic distribution in the standard space is more accurate, the generation of artificial artifacts is reduced, and compared with the attitudes 2, the learning of the conversion matrix exceeds the space which can be expressed by the parameterized model of the human hand due to the unreasonable deformation, so that the larger unreasonable deformation is introduced at the joints, and the rendering effect is poor.

Based on the above process, the conclusion of the qualitative experiment is helpful to learn the appointed meaningful standard space, and the dependence on different input gestures is reduced by setting reasonable standard gesture standards under the standard space, so that the accuracy of space conversion is improved, and the rendering effect of dynamic hands is further improved.

And step 170, rendering the volume color and the volume density of the sampling points to obtain a rendering result of the human hand.

Wherein the rendering result comprises a rendered image of the human hand and/or a three-dimensional model of the human hand.

In one possible implementation, step 170 may include the steps of: and performing volume rendering on the volume color and the volume density of the sampling points to obtain a rendered image of the human hand, and/or performing nerve rendering on the volume density of the sampling points to obtain a three-dimensional model of the human hand.

Regarding to nerve rendering to obtain a three-dimensional model of a human hand, specifically, the embodiment of the invention extracts a three-dimensional grid model from a nerve radiation field through a matching cube algorithm, and performs differential rendering on the volume density of sampling points based on the three-dimensional grid model to obtain the three-dimensional model of the human hand.

Through the process, the invention models the motion of the human hand into rigid deformation brought by skeleton motion and non-rigid offset compensation superimposed on the rigid deformation by combining the human hand parameterized model, solves the problem of dynamic rendering of the real human hand based on the advantage of capturing high-frequency textures by a nerve rendering method, introduces the human hand motion into the prior of rigid transformation by using the human hand parameterized model, models the non-rigidity of the human hand by introducing a gesture-dependent nerve network, and thus optimizes the gesture-dependent nerve radiation field in a time sequence unified standard space, and realizes the vivid effect of dynamic human hand rendering.

In an exemplary embodiment, the dynamic human hand rendering in the embodiment of the invention is realized by calling a dynamic human hand rendering model. The dynamic human hand rendering model comprises a non-rigid transformation MLP module for performing non-rigid transformation on coordinates of sampling points of a human hand image, and an identity-dependent nerve radiation field for obtaining volume colors and volume densities of the sampling points of the human hand image.

In an exemplary embodiment, training the dynamic human hand rendering model may include the steps of:

first, a training set is acquired.

The training images in the training set are human hand image sequences comprising multiple frames of human hand images, and each frame of human hand image is obtained by shooting different visual angles of human hands.

In one possible implementation, to facilitate the calculation of the loss value, the present invention uses a method of sampling the lot picture regions, sampling G picture regions of size h×h at a time and rendering g×h×h number of rays in each training lot, and the rendered ray positions are consistent with the sampled picture pixel positions.

And secondly, inputting training images in the training set into a machine learning model for training, and obtaining a rendering result generated in the training process and a loss value between the training images.

In one possible implementation, the loss function calculation formula is as follows:

L＝L _LPIPS +λL _MSE 。

wherein L is _MSE Mean square error between rendered image output for each frame and training image input correspondingly, L _LPIPS To a perceptual loss function using a VGG network as a reference.

And if the loss value meets the convergence condition, executing the third step to complete training.

Otherwise, if the loss value does not meet the convergence condition, executing the fourth step.

And thirdly, obtaining a dynamic human hand rendering model after training is completed.

And fourthly, updating model parameters of the machine learning model, acquiring other training images in the training set, inputting the training images into the machine learning model, and continuing training until the loss value meets the convergence condition.

Among other things, the inventors realized that inaccurate pose and shape parameters are prone to ambiguity. In order to solve the problem, the invention sets the gesture parameters and the shape parameters as the characteristic embedding which can be learned during the training, and simultaneously and automatically optimizes the gesture parameters and the shape parameters during the training, thereby not only obtaining better results, but also accelerating the convergence of the training, being beneficial to generating less blurring and further obtaining more vivid dynamic human hand rendering.

Through the process, the method and the device obtain the trained dynamic human hand rendering model by reducing the mean square error between the rendering image output by each frame and the training image correspondingly input and using the VGG network as a reference perception loss function, realize accurate and efficient space transformation and human hand image rendering, realize the rendering of high-frequency details of textures through the identity-dependent nerve radiation field, obtain more accurate texture reconstruction effect, and further remarkably improve the fidelity of the dynamic human hand rendering result.

In an exemplary embodiment, before step 130, in the case that the rendering camera emits light from the pixel plane of the hand image to the observation space, the embodiment of the present invention samples the light according to the set step length to obtain the density distribution of the light, performs dense sampling on the light with high density distribution, and performs sparse sampling on the light with low density distribution to obtain a plurality of sampling points.

The inventor realizes that because only a single human hand object is focused on modeling, a hypothesis is introduced to learn a more accurate nerve radiation field, a light ray is sampled according to a set step length, sampling points with the distance from the vertex in a corresponding human hand parameterized model larger than a set threshold value are searched after the density distribution of the light ray is obtained, the density of the searched sampling points is set as a set density, and the density distribution of the light ray is updated based on the sampling points after the density setting is completed.

In one possible implementation, the density of the found sampling points is set to 0, and then the density distribution of the light is updated based on the sampling points after the density setting is completed.

The specific calculation formula is as follows:

σ(x _k )＝d(x _k )≤δ

wherein d (x _k ) Is the point x _k The weighted distance from the nearest neighbor vertex in the observation space, delta, is the distance from the light sampling point in the observation space to the vertex of the corresponding parameterized model of the human hand.

Under the cooperation of the embodiment, the human hand is reversely transformed from the observation space to the standard space with the gesture dependency based on the gesture dependency based motion field distribution and the nerve radiation field in the multi-view scene, the nerve radiation field is enabled to learn the solvable hidden function representation in the standard space, the rendering of texture high-frequency details is realized through the solvable hidden function representation, the more accurate texture reconstruction effect is obtained, the nerve radiation field is rendered through the volume rendering method, the human hand synthesized image with any view angle is obtained, the effect is obvious in the aspect of rendering of the new view, and the fidelity of the dynamic human hand rendering result can be obviously improved.

Fig. 8 shows an effect schematic diagram of a three-dimensional reconstruction model of a human hand in an embodiment of the application. The first behavior is a real picture, and the second behavior is a three-dimensional reconstruction result. FIG. 9 illustrates an effect schematic of dynamic human hand rendering in an embodiment of the present application. The three-dimensional reconstruction result can be seen to keep the texture characteristics of the human hand realistically, and the real synthesis effect of the new gesture driving can be realized.

Table 1 Table of quantitative comparison results of examples of the present invention with the prior art

As shown in table 1, the embodiment of the present invention performs task setting mainly under new view angle reconstruction and new pose driving rendering quality, and the method NB (Neural Body) in table 1 is a deformable neural radiation field applied in human body reconstruction, and by assuming that the neural representations learned in different frames share the same set of potential codes anchored to the deformable mesh, the observation results across frames can be integrated naturally. The embodiment of the invention is trained in 3 video sequences, tested in 1 video sequence, and the used training visual angles are 4 visual angles, and as can be seen from table 1, the embodiment of the invention has better experimental performance on indexes such as peak signal-to-noise ratio (Peak Signal to Noise Ratio, PSNR), structural similarity (Structural Similarity, SSIM), leachable perceived image block similarity (Learned Perceptual Image Patch Similarity, LPIPS) and the like on new visual angle rendering and new gesture driving tasks.

The following is an embodiment of the apparatus of the present application, which may be used to execute the dynamic human hand rendering method related to the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to a method embodiment of the dynamic human hand rendering method related to the present application.

Referring to fig. 10, a dynamic human hand rendering apparatus 900 is provided in an embodiment of the present application. The apparatus 900 includes, but is not limited to: an image sequence acquisition module 910, a spatial transformation module 930, a neural radiation field learning module 950, and a human hand rendering module 970.

The image sequence obtaining module 910 is configured to obtain a human hand image sequence, where the human hand image sequence includes multiple frames of human hand images, and each frame of human hand image is obtained by shooting a human hand at different angles of view.

The space transformation module 930 is configured to transform the sampling points of the human hand image in the observation space from the observation space to the standard space based on the gesture-dependent motion field distribution, so as to obtain the coordinates of the sampling points in the standard space.

The neural radiation field learning module 950 is configured to learn the neural radiation field in the standard space according to the coordinates of the sampling points in the standard space, so as to obtain the volume color and the volume density of the sampling points.

The human hand rendering module 970 performs volume rendering on the volume color and the volume density of the sampling points to obtain a rendered image of the human hand, and performs differential rendering on the volume density of the sampling points to obtain a three-dimensional model of the human hand.

It should be noted that, when the dynamic hand rendering device provided in the foregoing embodiment performs dynamic hand rendering, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the dynamic hand rendering device is divided into different functional modules, so as to complete all or part of the functions described above.

In addition, the embodiments of the dynamic human hand rendering device and the dynamic human hand rendering method provided in the foregoing embodiments belong to the same concept, and the specific manner in which each module performs the operation has been described in detail in the method embodiment, which is not described herein again.

Fig. 11 shows a structural schematic of a server according to an exemplary embodiment. The server is suitable for use at the server side 130 in the implementation environment shown in fig. 1.

It should be noted that this server is only one example adapted to the present application, and should not be construed as providing any limitation on the scope of use of the present application. Nor should the server be construed as necessarily relying on or necessarily having one or more of the components of the exemplary server 2000 illustrated in fig. 11.

The hardware structure of the server 2000 may vary widely depending on the configuration or performance, as shown in fig. 11, the server 2000 includes: a power supply 210, an interface 230, at least one memory 250, and at least one central processing unit (CPU, central Processing Units) 270.

Specifically, the power supply 210 is configured to provide an operating voltage for each hardware device on the server 2000.

The interface 230 includes at least one wired or wireless network interface 231 for interacting with external devices. For example, in the implementation environment shown in FIG. 2, server side 130 and acquisition side 110.

Of course, in other examples of adaptation of the present application, the interface 230 may further include at least one serial-parallel conversion interface 233, at least one input-output interface 235, and at least one USB interface 237, as shown in fig. 11, which is not specifically limited herein.

The memory 250 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, where the resources stored include an operating system 251, application programs 253, and data 255, and the storage mode may be transient storage or permanent storage.

The operating system 251 is used for managing and controlling various hardware devices and applications 253 on the server 2000 to implement the operation and processing of the massive data 255 in the memory 250 by the central processing unit 270, which may be Windows server, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.

The application 253 is a computer program that performs at least one specific task based on the operating system 251, and may include at least one module (not shown in fig. 11), each of which may respectively include a computer program for the server 2000. For example, the dynamic human hand rendering device may be considered an application 253 deployed at the server 2000.

The data 255 may be a photograph, a picture, or the like stored in a magnetic disk, or may be input image data or the like, and stored in the memory 250.

The central processor 270 may include one or more processors and is configured to communicate with the memory 250 via at least one communication bus to read the computer program stored in the memory 250, thereby implementing the operation and processing of the bulk data 255 in the memory 250. For example, the dynamic human hand rendering method is accomplished by the central processor 270 reading a series of computer programs stored in the memory 250.

Furthermore, the present application can be realized by hardware circuitry or by a combination of hardware circuitry and software, and thus, the implementation of the present application is not limited to any specific hardware circuitry, software, or combination of the two.

Referring to fig. 12, in an embodiment of the present application, an electronic device 4000 is provided, and the electronic device 400 may include: desktop computers, notebook computers, servers, and the like.

In fig. 12, the electronic device 4000 includes at least one processor 4001, at least one communication bus 4002, and at least one memory 4003.

Wherein the processor 4001 is coupled to the memory 4003, such as via a communication bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The processor 4001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

The communication bus 4002 may include a pathway to transfer information between the aforementioned components. The communication bus 4002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The communication bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 12, but not only one bus or one type of bus.

Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 4003 has stored thereon a computer program, and the processor 4001 reads the computer program stored in the memory 4003 through the communication bus 4002.

The computer program, when executed by the processor 4001, implements the dynamic human hand rendering method in the above embodiments.

Further, in the embodiments of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the dynamic human hand rendering method in the above embodiments.

In an embodiment of the present application, a computer program product is provided, which includes a computer program stored in a storage medium. The processor of the computer device reads the computer program from the storage medium, and the processor executes the computer program, so that the computer device executes the dynamic human hand rendering method in the above embodiments.

Compared with the related art, the invention has the beneficial effects that:

1. the invention provides a novel dynamic human hand rendering method. According to the invention, by combining the nerve radiation field and the parameterized human hand model in the multi-view video scene, a more accurate texture reconstruction effect is obtained, the nerve radiation field can be driven, the observation space near the human hand is inversely transformed to a standard space with dependent gesture, the solvable hidden function representation is learned in the standard space, and the rendered image with high fidelity of the three-dimensional model and the shape and the texture of the drivable human hand which can be definitely controlled by the input action is output, so that the effect is remarkable in the aspects of two-dimensional rendering and three-dimensional reconstruction of a new view and a new gesture, and the strengthening capability of dynamically representing different gestures and shape changes of the human hand is shown.

2. According to the invention, a nerve radiation field and a parameterized human hand model are combined in a multi-view video scene, a more accurate texture reconstruction effect can be obtained, the nerve radiation field can be driven, a dynamic human hand is modeled as rigid deformation and non-rigid offset compensation deformation followed by a framework, the nerve radiation field is expanded by introducing the deformation guided by the gesture, the observation space near the human hand is inversely transformed to a gesture-dependent standard space through the skeletal rigid deformation field and the learned non-rigid deformation field of the parameterized human hand model, a solvable hidden function representation is learned in the standard space, and a drivable human hand three-dimensional model which can be definitely controlled by an input action is output. In addition, in order to avoid the problem of fuzzy results caused by the fact that the most advanced parameterized human hand model estimator cannot obtain accurate human hand parameters, the method of comprehensively analyzing and jointly optimizing the nerve radiation field and parameterized human hand model parameters is provided, so that not only can better results be obtained, but also convergence of training can be accelerated. Experiments show that the method provided by the invention is obviously superior to the latest synthetic method in the aspects of new view synthesis and new gesture synthesis, and the method has the strong generalization capability of new gesture driving.

3. According to the invention, the distribution field of the linear mixed skin weight field of the surface vertex of the parameterized human hand model parameter is utilized, the human hand motion is introduced into the prior of rigid transformation, and the non-rigidity of the human hand is modeled by explicitly introducing the neural network learning vertex offset compensation of posture dependence, so that the neural radiation field of shape dependence is optimized in a time sequence unified standard space, the realistic effect of free view rendering of the drivable human hand image is realized, the three-dimensional grid model of the human hand can be extracted through algorithms such as a matching cube algorithm, and the design of decoupling representation of the posture and the shape allows the user to explicitly drive, so that more applications are realized.

4. Compared with the prior art that a symbol distance field is used as a nerve volume expression, the invention uses a nerve radiation field as a rendering expression, has the capability of modeling high-frequency textures, has better performance on viewing angle consistency, can perform photo-realistic novel view synthesis, and reconstructs the 3D geometric shape of a human hand with high-quality details.

5. The invention has an explicit driving effect, fuses the gesture and shape parameters obtained from the parameterized human hand model as explicit driving parameters, can perform parameterized driving aligned with the parameterized human hand model semanteme, and increases the driving effect in semantic space.

6. The invention has wide application range. Augmented Reality (AR) and Virtual Reality (VR) have the potential to become the later major computing platforms enabling people to interact across space-time in a more immersive manner with real social telepresence aimed at the realistic presence in AR and VR, which requires no distinction from reality, putting a fundamental requirement on the technology that every possible detail expressed in reality by humans is actually conveyed, the high fidelity reconstruction effect achieved by the present invention can be applied to VR and AR scenes where fine realism modeling of the hands is required.

7. The present invention may provide a human hand image dataset. The invention can realize the hand rendering driven by the new posture of any visual angle with high fidelity, thereby being capable of generating the high-fidelity hand image with the new posture by combining the new scene and being used for training data of the deep learning task.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The foregoing is merely illustrative of the preferred embodiments of the present invention and is not intended to limit the embodiments of the present invention, and those skilled in the art can easily make corresponding variations or modifications according to the main concept and spirit of the present invention, so that the protection scope of the present invention shall be defined by the claims.

Claims

1. A method of dynamic human hand rendering, the method comprising:

acquiring a human hand image sequence, wherein the human hand image sequence comprises a plurality of frames of human hand images, and each frame of human hand image is obtained by shooting different visual angles of human hands;

transforming the sampling points of the hand image in the observation space from the observation space to the standard space through motion field distribution based on posture dependence to obtain sampling point coordinates of the sampling points in the standard space;

according to sampling point coordinates of sampling points in the standard space, learning an identity-dependent neural radiation field in the standard space to obtain the volume color and the volume density of the sampling points;

and rendering the volume color and the volume density of the sampling points to obtain a rendering result of the human hand, wherein the rendering result comprises a rendering image of the human hand and/or a three-dimensional model of the human hand.

2. The method of claim 1, wherein transforming the sample points of the hand image in the observation space from the observation space to the standard space by the gesture-dependent motion field distribution, to obtain sample point coordinates of the sample points in the standard space, comprises:

estimating the pose and the shape of each frame of human hand image through a human hand parameterized model to obtain the pose parameters and the shape parameters of each frame of human hand image;

learning to obtain motion field distribution based on gesture dependence through guidance of the gesture parameters and the shape parameters in an observation space;

and respectively carrying out rigid transformation and non-rigid transformation offset compensation on sampling points of the hand image in the observation space by using motion field distribution based on gesture dependence to obtain sampling point coordinates of the sampling points in the standard space.

3. The method of claim 2, wherein rigidly transforming the sampling points of the human hand image in the observation space using the pose-dependent motion field distribution, respectively, comprises:

searching a plurality of vertexes closest to the sampling point in each vertex of the human hand parameterized model based on the sampling point of the human hand image in the observation space, and taking the distances between the searched vertexes and the sampling point as weights;

Carrying out weighted summation on a preset rendering transformation matrix based on the weight to obtain a rigid transformation matrix of the sampling point transformed from the observation space to the standard space;

and carrying out rigid transformation on the sampling points according to the rigid transformation matrix.

4. The method of claim 1, wherein rendering the volume color and volume density of the sampling points to obtain a rendering result of the human hand comprises:

performing volume rendering on the volume color and the volume density of the sampling point to obtain a rendered image of the human hand, and/or performing nerve rendering on the volume density of the sampling point to obtain a three-dimensional model of the human hand;

the neural rendering of the volume density of the sampling point is performed to obtain a three-dimensional model of the human hand, which comprises the following steps:

extracting a three-dimensional grid model from the nerve radiation field through a matching cube algorithm;

and performing differential rendering on the volume density of the sampling points based on the three-dimensional grid model to obtain the three-dimensional model of the human hand.

5. The method according to claim 1, wherein said transforming the sample points of the hand image in the observation space from the observation space to the standard space by means of the gesture-dependent motion field distribution, before obtaining the sample point coordinates of the sample points in the standard space, comprises:

Under the condition that a rendering camera emits light rays from a pixel plane of the hand image to an observation space, sampling the light rays according to a set step length to obtain density distribution of the light rays;

and densely sampling the light rays with high density distribution, sparsely sampling the light rays with low density distribution, and obtaining a plurality of sampling points.

6. The method of claim 5, wherein after sampling the light at a set step size to obtain a density distribution of the light, the method further comprises:

searching sampling points with the distance from the vertex in the corresponding human hand parameterized model being greater than a set threshold value;

and setting the density of the searched sampling points as a set density, and updating the density distribution of the light rays based on the sampling points after the density setting is completed.

7. The method of any of claims 1 to 6, wherein the dynamic human hand rendering is implemented by invoking a dynamic human hand rendering model, the dynamic human hand rendering model being a trained machine learning model having dynamic human hand rendering capabilities for the human hand image.

8. The method of claim 6, wherein the training process of the dynamic human hand rendering model comprises:

Acquiring a training set;

inputting training images in the training set into the machine learning model for training to obtain a rendering result generated in the training process and a loss value between the training images;

if the loss value meets the convergence condition, training is completed, and the dynamic human hand rendering model is obtained; otherwise, updating model parameters of the machine learning model, acquiring other training images in a training set, inputting the other training images into the machine learning model, and continuing training until the loss value meets the convergence condition.

9. A dynamic human hand rendering device, the device comprising:

the image sequence acquisition module is used for acquiring a human hand image sequence, wherein the human hand image sequence comprises a plurality of frames of human hand images, and each frame of human hand image is obtained by shooting the human hand at different visual angles;

the space transformation module is used for transforming the sampling points of the hand image in the observation space from the observation space to the standard space through motion field distribution based on posture dependence to obtain sampling point coordinates of the sampling points in the standard space;

the nerve radiation field learning module is used for learning the nerve radiation field based on identity dependence in the standard space according to the sampling point coordinates of the sampling points in the standard space to obtain the volume color and the volume density of the sampling points;

And the human hand rendering module is used for rendering the volume color and the volume density of the sampling points to obtain a rendering result of the human hand, wherein the rendering result comprises a rendering image of the human hand and/or a three-dimensional model of the human hand.

10. A storage medium having stored thereon a computer program which, when executed by a processor, implements a dynamic human hand rendering method according to any of claims 1 to 8.